Storage system

ABSTRACT

A storage group configured by a plurality of storage devices is configured by a plurality of storage sub-groups, and the respective storage sub-groups are configured from two or more storage devices. A sub-group storage area, which is the storage area of the respective storage sub-groups, is configured by a plurality of rows of sub-storage areas. A data set, which is configured by a plurality of data elements configuring a data unit, and a second redundancy code created on the basis of this data unit, is written to a row of sub-storage areas, a compressed redundancy code is created on the basis of two or more first redundancy codes respectively created based on two or more data units of two or more storage sub-groups, and this compressed redundancy code is written to a nonvolatile storage area that differs from the above-mentioned two or more storage sub-groups.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation of application Ser. No.12/060,966, filed Apr. 2, 2008, now U.S. Pat. No. 7,970,995; whichrelates to and claims the benefit of priority from Japanese PatentApplication number 2008-24504, filed on Feb. 4, 2008, the entiredisclosure of which is incorporated herein by reference.

BACKGROUND OF THE INVENTION

The present invention generally relates to a storage system.

Generally speaking, a storage system, in which technology called RAID(Redundant Arrays of Independent (or Inexpensive) Disks) is employed,comprises a RAID group configured from a plurality of storage devices. ARAID group storage area is configured from a plurality of rows ofsub-storage areas. The respective rows of sub-storage areas extendacross the plurality of storage devices configuring the RAID group, andare configured from a plurality of storage areas corresponding to theplurality of storage devices. Hereinafter, one sub-storage area will becalled a “stripe,” and one row configured from a plurality of stripeswill be called a “row of stripes”. The RAID group storage area is madeup of consecutive rows of stripes.

A RAID is known to have a number of levels (referred to as “RAID levels”hereinafter).

For example, there is RAID 5. In RAID 5, data is distributively writtento a plurality of storage devices (for example, hard disk drives (HDD)),which configure the RAID group corresponding to RAID 5. Morespecifically, for example, write-targeted data specified by a hostcomputer corresponding to RAID 5 is divided into data of a prescribedsize (hereinafter, for the sake of convenience, referred to as a “dataunit”), each data unit is divided into a plurality of data elements, andthe plurality of data elements is written to a plurality of stripes.Further, in RAID 5, in order to restore a data element that can nolonger be read out from a storage device due to this storage devicehaving failed, redundancy data called “parity” (hereinafter, “redundancycode”) is created for a single data unit, and this redundancy code isalso written to a stripe. More specifically, for example, when there arefour storage devices configuring a RAID group, three data elements,which configure a data unit, are written to three stripes correspondingto three storage devices from thereamong, and a redundancy code iswritten to the stripe corresponding to the one remaining storage device.If a failure should occur in one of the four storage devices configuringthe RAID group, the unreadable data element is restored by using theremaining two data elements configuring the data unit, which comprisesthis unreadable data element, and the redundancy code corresponding tothis data unit.

The problem with RAID 5 is that it is unable to tolerate a so-calleddouble-failure. More specifically, when two elements (either two dataelements, or one data element and a redundancy code) in a data set,which is configured by a data unit and a redundancy code, becomeunreadable due to the fact that two of the plurality of storage devicesconfiguring the RAID group have failed, these two elements cannot berestored. This is because only one redundancy code is created for eachdata unit.

RAID 6 is the RAID level that is capable of tolerating a double-failurelike this. In RAID 6, two (two types of) redundancy codes are createdfor each row of stripes (Intelligent RAID 6 Theory Overview andImplementation.

However, while RAID 6 has the advantage of being able to tolerate adouble-failure, it has the disadvantage of requiring more storagecapacity than RAID 5 for a single data unit. This is because moreredundancy codes are written for a single data unit than in RAID 5.

SUMMARY OF THE INVENTION

Accordingly, an object of the present invention is to make it possibleto both restore two elements, which have become unreadable in a singledata set, and to conserve consumed storage capacity.

In a first aspect, a storage group, which is configured by a pluralityof storage devices, is configured from a plurality of storagesub-groups, and the respective storage sub-groups are configured fromtwo or more storage devices of a plurality of storage devices. Theplurality of storage sub-groups are configured from a plurality of firsttype storage sub-groups and a plurality of second type storagesub-groups. Two or more storage devices, which configure the respectivesecond type storage sub-group, are storage devices constituting therespective components of the plurality of first type storage sub-groups,and therefore, the respective storage devices, which configure a storagegroup, constitute the components of both any of the plurality of firsttype storage sub-groups and any of the plurality of second type storagesub-groups. The respective first type sub-group storage areas, which arethe respective storage areas of the respective first type storagesub-groups, are configured from a plurality of rows of first typesub-storage areas. A row of first type sub-storage areas spans two ormore storage devices configuring the first type storage sub-group, andis configured from two or more first type sub-storage areascorresponding to these two or more storage devices. Respective secondtype sub-group storage areas, which are the respective storage areas ofthe plurality of second type storage sub-groups, are configured from theplurality of rows of second type sub-storage areas. The row of secondtype sub-storage areas spans two or more storage devices, whichconfigure a second type storage sub-group, and is configured from two ormore second type sub-storage areas corresponding to these two or morestorage devices.

In a second aspect, a storage group, which is configured by a pluralityof storage devices, is configured from a plurality of storagesub-groups, and the respective storage sub-groups are configured by twoor more storage devices of the plurality of storage devices. A sub-groupstorage area, which is the storage area of the respective storagesub-groups, is configured from a plurality of rows of sub-storage areas.The respective rows of sub-storage areas span the above-mentioned two ormore storage devices, and are configured from a plurality of sub-storageareas corresponding to the above-mentioned plurality of storage devices.In this configuration, a data set, which is configured from a pluralityof data elements configuring a data unit and a second redundancy code,which is created on the basis of the above-mentioned data unit, iswritten to a row of sub-storage areas, and a compressed redundancy code,which is one code of a size that is smaller than the total size of twoor more first redundancy codes, is created on the basis of two or morefirst redundancy codes respectively created based on two or more dataunits in two or more storage sub-groups of the plurality of storagesub-groups, and this compressed redundancy code is written to anonvolatile storage area that differs from the above-mentioned two ormore storage sub-groups.

In a third aspect, a storage group, which is configured by a pluralityof storage devices, is configured from a plurality of storagesub-groups. The respective storage sub-groups are configured by two ormore storage devices of the plurality of storage devices. When a dataset, comprising a data unit and a redundancy code created on the basisof this data unit, is written to a certain storage sub-group, adifferent type redundancy code related to this data unit is written to astorage sub-group that differs from this storage sub-group. When amultiple-failure data set, which is a data set comprising a first andsecond element that are impossible to read out, exists in theabove-mentioned certain storage sub-group, the first and second elementsare restored by making use of the above-mentioned different typeredundancy code that exists in the above-mentioned different storagesub-group.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram showing the physical configuration of a storagesystem related to a first embodiment of the present invention;

FIG. 2 is a diagram showing an example of the logical configuration ofthe storage system 1 related to this embodiment;

FIG. 3 is a diagram showing an example of the relationships between aplurality of HDD and logical volumes configuring a RAID group;

FIG. 4 is a diagram showing an example of the configuration of a RAIDconfiguration table;

FIG. 5 is a diagram showing an example of the configuration of a VDEVconfiguration table;

FIG. 6 is a diagram showing an example of the configuration of a LUconfiguration table;

FIG. 7 is a diagram showing an example of the configuration of a DiskGroup configuration table;

FIG. 8 is a diagram showing an example of the configuration of a HDDfailure-check table;

FIG. 9 is a diagram showing an example of the configuration of a stripefailure-check table;

FIG. 10 is a schematic diagram of a write in a first data protectionmode;

FIG. 11 is a schematic diagram of a restore in the first data protectionmode;

FIG. 12 is a schematic diagram of a write in a second data protectionmode;

FIG. 13 is a schematic diagram of a restore in the second dataprotection mode;

FIG. 14 is a schematic diagram of a write in a third data protectionmode;

FIG. 15 is a schematic diagram of a restore in the third data protectionmode;

FIG. 16 is a schematic diagram of a write in a fourth data protectionmode;

FIG. 17 is a schematic diagram of a restore in the fourth dataprotection mode;

FIG. 18 is a flowchart of the process carried out by a command processorwhen the storage system receives an I/O request from the host;

FIG. 19 is a flowchart of an I/O process based on the first dataprotection mode;

FIG. 20 is a flowchart of a restore process in the first data protectionmode;

FIG. 21 is a flowchart of processing corresponding to S304 of FIG. 20;

FIG. 22 is a flowchart of I/O processing based on the second and thirddata protection modes;

FIG. 23 is a flowchart of processing corresponding to S2208 of FIG. 22;

FIG. 24 is a flowchart of restore processing in the second and thirddata protection modes;

FIG. 25 is a flowchart of processing corresponding to S2407 of FIG. 24;

FIG. 26 is a flowchart of a process for restoring an HDD, in whichvertical parity is stored in the second data protection mode;

FIG. 27 is a flowchart of an I/O process based on the fourth dataprotection mode;

FIG. 28 is a flowchart of processing corresponding to S2708 of FIG. 27;

FIG. 29 is a flowchart of a restore process in the fourth dataprotection mode;

FIG. 30 is a flowchart of processing corresponding to S2909 of FIG. 29;

FIG. 31 is a diagram showing another example of the relationshipsbetween the plurality of HDD and the logical volumes configuring theRAID group;

FIG. 32A is a schematic diagram related to the amount of data comprisingthe RAID group in RAID 6; and

FIG. 32B shows the type of HDD in the case of a RAID group for which ahorizontal RAID group and a vertical RAID group both correspond to RAID4.

DESCRIPTION OF THE PREFERRED EMBODIMENT

In Embodiment 1, a storage system comprises a storage group configuredfrom a plurality of storage devices; and a write controller thatcontrols a write to the above-mentioned storage group. Theabove-mentioned storage group is configured from a plurality of storagesub-groups. The respective storage sub-groups are configured from two ormore storage devices of the above-mentioned plurality of storagedevices. The above-mentioned plurality of storage sub-groups isconfigured from a plurality of first type storage sub-groups and aplurality of second type storage sub-groups. The two or more storagedevices, which configure the respective second type storage sub-groups,are storage devices, which respectively constitute the components of theabove-mentioned plurality of first type storage sub-groups, andtherefore, the respective storage devices configuring theabove-mentioned storage group constitute the components of both any ofthe plurality of first type storage sub-groups, and any of the pluralityof second type storage sub-groups. The respective first type sub-groupstorage areas, which are the respective storage areas of theabove-mentioned respective first type storage sub-groups, are configuredfrom a plurality of rows of first type sub-storage areas. Theabove-mentioned row of first type sub-storage areas spans theabove-mentioned two or more storage devices configuring theabove-mentioned first type storage sub-group, and is configured from twoor more first type sub-storage areas corresponding to these two or morestorage devices. The respective second type sub-group storage areas,which are the respective storage areas of the above-mentioned respectivesecond type storage sub-groups, are configured from the plurality ofrows of second type sub-storage areas. The above-mentioned row of secondtype sub-storage areas spans the above-mentioned two or more storagedevices, which configure the above-mentioned second type storagesub-group, and is configured from two or more second type sub-storageareas corresponding to these two or more storage devices.

In Embodiment 1, for example, the storage group is the total RAID group,which will be explained hereinbelow, the first type storage sub-group isa horizontal RAID group, which will be explained hereinbelow, the row offirst type sub-storage areas is a horizontal row of stripes, which willbe explained hereinbelow, the first redundancy code is a horizontalparity, which will be explained hereinbelow, the second type storagesub-group is a vertical RAID group, which will be explained hereinbelow,the row of second sub-storage areas is a vertical row of stripes, whichwill be explained hereinbelow, and the second redundancy code is avertical parity, which will be explained hereinbelow.

In Embodiment 2 according to Embodiment 1, as the data unit, which isdata of a prescribed size, and which is made up of a plurality ofelements, there are a first type data unit and a second type data unit.The above-mentioned first type data unit is configured from a pluralityof data elements. The above-mentioned second type data unit either isconfigured from one data element of each of the plurality of first typedata units, or is configured from a plurality of first redundancy codes.The above-mentioned write controller: (W1) writes a data set, whichcomprises a plurality of data elements configuring the above-mentionedfirst type data unit and a first redundancy code created based on theabove-mentioned plurality of data elements to the above-mentioned row offirst type sub-storage areas; and (W2) writes a second redundancy code,which is created based on the above-mentioned second type data unitresiding in the row of second type sub-storage areas, to a free secondtype sub-storage area in this row of second type sub storage areas.

In Embodiment 3 according to Embodiment 2, the above-mentioned storagesystem further comprises a restore controller that controls a restore tothe above-mentioned storage group. When a multiple-failure data set,which is a data set comprising an unreadable first and second element,exists in the above-mentioned storage group, the above-mentioned restorecontroller: restores (R1) the above-mentioned first element in thesecond data unit, which comprises the above-mentioned first element,based on a data element other than the above-mentioned first element anda second redundancy code created on the basis of this second data unit;and restores (R2) the above-mentioned second element in the second dataunit, which comprises the above-mentioned second element, either basedon a data element other than the above-mentioned second element and asecond redundancy code created on the basis of this second data unit, orbased on the restored first element and an element other than theabove-mentioned first element in the above-mentioned data set.

The above-mentioned first and second elements are either the twoabove-mentioned data elements, or one of the above-mentioned dataelements and the above-mentioned first redundancy code.

In Embodiment 4 according to any of Embodiments 1 through 3, theabove-mentioned respective second type storage sub-groups are RAIDgroups corresponding to RAID 4.

In Embodiment 5 according to any of Embodiments 1 through 3, theabove-mentioned respective second type storage sub-groups are RAIDgroups corresponding to RAID 5.

In Embodiment 6 according to either Embodiment 4 or Embodiment 5, theabove-mentioned respective first type storage sub-groups are RAID groupscorresponding to RAID 5.

In Embodiment 7 according to any of Embodiments 2 through 6, theabove-mentioned write controller updates the second redundancy codecorresponding to the second data unit comprising an updated dataelement, and/or the second redundancy code corresponding to the seconddata unit comprising an updated first redundancy code, asynchronously tothe timing at which a data element in a certain first data unit and,together therewith, the first redundancy code corresponding to theabove-mentioned certain first data unit have been updated.

In Embodiment 8 according to any of Embodiments 4 through 7, two or morelogical addresses respectively corresponding to two or more first typesub-storage areas configuring a Pth row of first type sub-storage areasinside a Kth first type storage sub-group, and two or more logicaladdresses respectively corresponding to two or more first typesub-storage areas configuring a Pth row of first type sub-storage areasinside a K+1st first type storage sub-group are consecutive (K and Pbeing integers greater than 0).

In Embodiment 8, the storage system comprises a storage group, which isconfigured from a plurality of storage devices, and a write controllerthat writes a data unit, which is data of a prescribed size, to theabove-mentioned storage group. The above-mentioned storage group isconfigured from a plurality of storage sub-groups. The respectivestorage sub-groups are configured from two or more storage devices ofthe above-mentioned plurality of storage devices. The sub-group storagearea, which is the storage area of the above-mentioned respectivestorage sub-groups, is configured from a plurality of rows ofsub-storage areas. The respective rows of sub-storage areas span theabove-mentioned two or more storage devices, and are configured from aplurality of sub-storage areas corresponding to the above-mentionedplurality of storage devices. The above-mentioned write controller: (W1)writes a data set, which is configured from a plurality of data elementsconfiguring a data unit and a second redundancy code created on thebasis of the above-mentioned data unit, to a row of sub-storage areas;(W2) creates a compressed redundancy code, which is one code of a sizethat is smaller than the total size of two or more first redundancycodes, on the basis of the above-mentioned two or more first redundancycodes respectively created based on two or more data units in two ormore storage sub-groups of the above-mentioned plurality of storagesub-groups; and (W3) writes the above-mentioned created compressedredundancy code to a nonvolatile storage area that differs from theabove-mentioned two or more storage sub-groups.

In Embodiment 8, for example, the storage group is a total RAID group,which will be explained hereinbelow, the storage sub-group is a part ofthe RAID group, which will be explained hereinbelow, the firstredundancy code is a P parity, which will be explained hereinbelow, andthe second redundancy code is a Q parity, which will be explainedhereinbelow.

In Embodiment 9, the storage system further comprises a restorecontroller that restores a data element configuring a data unit storedin the above-mentioned storage group. When a multiple-failure data set,which is a data set comprising an unreadable first and second elements,exists in the above-mentioned storage group, the above-mentioned restorecontroller: (R1) reads out the above-mentioned compressed redundancycode from the above-mentioned nonvolatile storage area, and restores afirst redundancy code created on the basis of a data unit in theabove-mentioned multiple-failure data set, based on the above-mentionedread-out compressed redundancy code, and either one or a plurality offirst redundancy codes, which constitute the basis of theabove-mentioned compressed redundancy code created on the basis ofeither one or a plurality of data units in either one or a plurality ofdata sets other than the above-mentioned multiple-failure data set; and(R2) restores the above-mentioned first and second elements, based onthe above-mentioned restored first redundancy code and an element otherthan the above-mentioned first and second elements in theabove-mentioned multiple-failure data set, and the above-mentioned firstand second elements either are the two above-mentioned data elements, orare one of the above-mentioned data elements, and the above-mentionedsecond redundancy code.

In Embodiment 10 according to either Embodiment 8 or Embodiment 9, twoor more logical addresses respectively corresponding to two or moresub-storage areas configuring a Pth row of sub-storage areas in a Kthstorage sub-group, and two or more logical addresses respectivelycorresponding to two or more sub-storage areas configuring a Pth row ofsub-storage areas in a K+1st storage sub-group are consecutive for theabove-mentioned two or more storage sub-groups (K and P being integersgreater than 0).

In Embodiment 11 according to any of Embodiments 8 through 10, theabove-mentioned nonvolatile storage area is a sub-storage area inside astorage sub-group that differs from the above-mentioned two or morestorage sub-groups of the above-mentioned plurality of storagesub-groups.

In Embodiment 12, the storage system comprises a storage groupconfigured from a plurality of storage devices; a write controller thatwrites a data unit, which is data of a prescribed size and a redundancycode created on the basis of this data unit; and a restore controllerthat controls a restore to the above-mentioned storage group. Theabove-mentioned storage group is configured from a plurality of storagesub-groups. The respective storage sub-groups are configured from two ormore storage devices of the above-mentioned plurality of storagedevices. When the above-mentioned write controller writes a data set,comprising a data unit and a redundancy code created on the basis ofthis data unit, to a certain storage sub-group, the above-mentionedwrite control unit writes a different type redundancy code related tothis data unit to a storage sub-group that differs from this storagesub-group. When a multiple-failure data set, which is a data setcomprising unreadable first and second elements, exists in theabove-mentioned certain storage sub-group, the above-mentioned restorecontroller uses the above-mentioned different type redundancy code thatexists in the above-mentioned different storage sub-group to restore theabove-mentioned first and second elements.

In Embodiment 12, for example, the storage group can be the total RAIDgroup, which will be explained hereinbelow, the storage sub-group can beeither a horizontal RAID group or a vertical RAID group, which will beexplained hereinbelow, and the row of sub-storage areas can be a row ofhorizontal stripes, which will be explained hereinbelow, or a row ofvertical stripes, which will be explained hereinbelow. Or, for example,the storage group can be the total RAID group, which will be explainedhereinbelow, the storage sub-group can be a part of the RAID group,which will be explained hereinbelow, and the first redundancy code canbe a P parity, which will be explained hereinbelow.

At least one of the above-mentioned write controller and restorecontroller can be constructed using hardware (for example, a circuit), acomputer program, or a combination thereof (for example, one part can berealized via a computer program, and the remaining part or parts can berealized via hardware). The computer program is executed by being readinto a prescribed processor. Further, a storage region that exists in ahardware resource, like a memory, can be used as needed for informationprocessing, which is carried out by a computer program being read intothe processor. Further, the computer program can be installed in acomputer from a CD-ROM or other such recording medium, and can also bedownloaded to the computer via a communication network.

The first embodiment of the present invention will be explained indetail hereinbelow while referring to the figures. Furthermore, in thefollowing explanation, respective data of a prescribed size needed tocreate a redundancy code will be referred to as a “data unit”, and data,which is a component of a data unit, and which is stored in a stripewill be referred to as a “data element”. Further, in the followingexplanation, a storage group will be called a “RAID group”, and it willbe supposed that the respective storage devices configuring a RAID groupare HDD (hard disk drives).

FIG. 1 is a diagram showing the physical configuration of a storagesystem 1 related to the first embodiment of the present invention.

One or more host computers (hereinafter, host) 4, and the storage system1 are connected via a FC (Fibre Channel) switch 5. In this figure, thehost 4 and the storage system 1 are connected via a single FC switch 5,but the host 4 and the storage system 1 can also be connected via aplurality of FC switches 5. Furthermore, a SAN (Storage Area Network) isconstructed by one or more FC switches 5. The FC switch 5 and the host4, and the FC switch 5 and the host adapter 11 of the storage system 1are respectively connected by fibre channel cables. The host 4 can senda data I/O request (for example, a read request or a write request) tothe storage system 1 by way of the FC switch 5.

The storage system 1, for example, can be made into a RAID systemcomprising a large number of HDD (hard disk drives) 16 arranged in anarray. The storage system 1, for example, comprises a CHA (channeladapter) 11, a DKA (disk adapter) 13, cache/control memory 14, andinternal switch 15 as its controller. Access to the HDD 16 is controlledby the storage system 1 controller. Furthermore, for example, thestorage system 1 can also be realized by equipping the FC switch 5 withthe functions of the CHA 11, DKA 13, and internal switch 15, andcombining the FC switch 5 with a plurality of HDD 16.

The CHA 11 has either one or a plurality of I/F (for example, acommunication port, or a communication control circuit comprising acommunication port) 113, communicably connected to an external device(for example, a host or other storage system), and is for carrying outdata communications with the external device. The CHA 11 is configuredas a microcomputer system (for example, a circuit board) comprising aCPU 111 and a memory 112. The CHA 11, for example, writes write-targeteddata to the cache area of the cache/control memory 14 when there is awrite request from the host 4. Further, the CHA 11 sends read-targeteddata, which the DKA 13 has read out from the HDD 16 and written to thecache area of the cache/control memory 14, to the host 4 when there is aread request from the host 4.

The DKA 13 has either one or a plurality of drive I/F (for example, acommunication port or a communication control circuit comprising acommunication port) 133 communicably connected to the respective HDD 16,and is for carrying out communications with the HDD 16. The DKA 13 isconfigured as a microcomputer system (for example, a circuit board)comprising a CPU 131 and memory 132. The DKA 13, for example, writeswrite-targeted data, which has been written to the cache area of thecache/control memory 14 from the CHA 11, to the HDD 16, and writesread-targeted data read out from the HDD 16 to the cache area.

Further, the DKA 13 comprises a parity create module 134 that creates aredundancy code (hereinafter, parity) for restoring a data element,which has become impossible to read out due to a failure that hasoccurred in the HDD. In this embodiment, the parity create module 134 isa hardware circuit for creating parity, but can also be a functionincorporated into a computer program. The parity create module 134, forexample, creates parity by computing an exclusive OR for the pluralityof data elements configuring a data unit (or, by applying a prescribedcoefficient to the plurality of data elements making up the data unit,and subsequently computing the exclusive OR for the respective data).Further, the parity create module 134 can create one parity based on aplurality of parities (hereinafter, called a “compressed parity”). Inthis embodiment, as will be explained hereinbelow, the way of creatingparity will differ according to which data protection mode, of any of afirst through a fourth data protection mode, is protecting awrite-targeted VDEV (VDEV will be explained hereinbelow). In the firstdata protection mode, one compressed parity is created in each HDD. Morespecifically, one compressed parity is created for each HDD on the basisof a plurality of data elements and a plurality of first parities(hereinafter, P parities) stored in a plurality of stripes configuringan HDD. In the fourth data protection mode, one compressed parity iscreated on the basis of a plurality of P parities corresponding to aplurality of data units.

The cache/control memory 14, for example, is either a volatile or anonvolatile memory. The cache/control memory 14 has a cache area and acontrol area. The cache/control memory 14 can be configured from twomemories: a memory having a cache area, and a memory having a controlarea. The cache area temporarily stores data received from an externaldevice (host 4), and data read out from the HDD 16. The control areastores information (hereinafter, control information) related to thecontrol of the storage system 1. Control information, for example, caninclude a variety of tables, which will be explained hereinbelow.

The internal switch 15, for example, is a crossbar switch, and is thedevice that interconnects the CHA 11, DKA 13, and cache/control memory14. Instead of the internal switch 15, another type of connector, suchas a bus, can be used.

A management terminal 6, for example, is connected to the internalswitch 15. The management terminal 6 is a computer for managing thestorage system 1. The management terminal 6, for example, can store avariety of tables, which will be explained hereinbelow, in the controlarea of the cache/control memory 14. Furthermore, the functions carriedout by the management terminal 6 can be provided in the host 4. That is,the host 4 can store the various tables that will be explainedhereinbelow.

The preceding is an explanation of an example of the physicalconfiguration of the storage system related to the first embodiment.Furthermore, the above explanation is an example; there is no need tolimit the configuration of this storage system. For example, thecontroller can have a simpler configuration, and, for example, can beconfigured comprising a CPU and a memory on one circuit board (aconfiguration by which the functions of the CHA 11 and DKA 13 arerealized via a single circuit board).

FIG. 2 is a diagram showing an example of the logical configuration ofthe storage system 1 related to this embodiment.

In the CHA 11, for example, a command processor 201, for example, isstored in the memory 112 of the CHA 11 as a computer program to beexecuted by the CPU 111. In the DKA 13, for example, a disk I/Oprocessor 202 and a logical/physical converter 203, for example, arestored in the memory 132 as programs to be executed by the CPU 131.Hereinafter, whenever a computer program is the subject of anexplanation, this will signify that the processing is actually beingcarried out by the CPU, which executes this computer program.

The command processor 201 processes an I/O request received from thehost 4. For example, when an I/O request is a write request, the commandprocessor 201 writes the write-targeted data accompanying this writerequest to the cache area.

The logical/physical converter 203 converts a logical address to aphysical address. A logical address, for example, is a LDEV identifieror the LBA (Logical Block Address) of this LDEV. A physical address isan LBA used for specifying the location of the respective disk blocksinside an HDD 16, or a “combination of a cylinder number, track numberand sector number (CC, HH, SS)”.

The disk I/O processor 202 controls the input/output of data to/from theHDD 16. More specifically, for example, the disk I/O processor 202divides write-targeted data stored in the cache area into a plurality ofdata units, and writes the respective data units to the RAID group. Atthis time, the disk I/O processor 202 uses the logical/physicalconverter 203 to convert the logical address of the access destinationto a physical address, and sends the physical address-specifying I/Orequest to an HDD 16. Consequently, the data element and parity can bewritten to the storage area corresponding to this physical address, andthe data element and parity can be read from the storage areacorresponding to this physical address.

FIG. 3 is a diagram showing an example of the relationships between aplurality of HDD 16 and logical volumes.

A single RAID group is configured from a plurality of (for example,four) HDD 16-1, 16-2, 16-3 and 16-4. For example, when the RAID level isRAID 5, three data elements configuring a data unit are stored in threeHDD 16, and a P parity created on the basis of these three data elementsis stored in one other HDD 16.

In this embodiment, the storage area provided by this either one orplurality of RAID groups (a cluster of storage areas of a plurality ofHDD 16) is called a “VDEV”, which is an abbreviation for Virtual Device.In this example, one VDEV is corresponded to one RAID group. Therespective VDEV parts obtained by partitioning this VDEV are calledlogical volumes in this embodiment. A logical volume is specified fromthe host 4, and is identified inside the storage system 1 as well.Accordingly, hereinafter a logical volume specified from the host 4 maybe called a “LU” (Logical Unit), and a logical volume identified insidethe storage system 1 may be called an “LDEV” (Logical Device). In theexample of this diagram, three LDEV are created from one VDEV, but thenumber of LDEV can be either more or less than this (for example, therecan be one LDEV for one VDEV).

A VDEV is configured from a plurality of rows of stripes. The respectiverows of stripes are configured by four stripes corresponding to the fourHDD 16-1, 16-2, 16-3 and 16-4. The HDD 16 storage area is partitionedinto a plurality of prescribed size sub-storage areas (that is,stripes). A data element write or parity write is carried out having astripe as one unit.

FIG. 31 is a diagram showing another example illustrating therelationships between the plurality of HDD 16 and the logical volumes.In the example explained using FIG. 3, one VDEV was corresponded to oneRAID group, but in this example, one VDEV is corresponded to a pluralityof RAID groups.

In this figure, one RAID group 0 is configured from a plurality (forexample, four) HDD 16-0-0, 16-0-1, 16-0-2, and 16-0-3. The same holdstrue for RAID group 1 and RAID group 2. In this example, one VDEV iscorresponded to a plurality (for example, three) RAID groups. Similar tothe example of FIG. 3, the respective parts of the VDEV obtained bypartitioning this VDEV are LDEV. According to the example shown in FIG.31, the respective LDEV span a plurality (for example, three) RAIDgroups.

In this figure, the symbols inside the HDD 16 (0, 1, 2, . . . and P0,P1, P2, . . . ) are identifiers for uniquely identifying data elementsand parities inside the VDEV. These consecutive data elements are storedin one LDEV. Unlike the example of FIG. 3, since one VDEV is configuredfrom a plurality of RAID groups in this example, logically consecutivedata elements can be arranged across a plurality of RAID groups.Arranging these data elements like this makes it possible to enhanceaccess performance by distributing access to the one LDEV among theplurality of RAID groups.

Various tables comprising control information stored in thecache/control memory 14 will be explained hereinbelow by referring toFIGS. 4 through 9.

FIG. 4 is a diagram showing an example of the configuration of a RAIDconfiguration table 400.

The RAID configuration table 400 is for managing the RAID configurationsof the respective VDEV. More specifically, for example, this table 400has a column 401 in which VDEV identification numbers are written, acolumn 402 in which HDD identification numbers are written, a column 403in which RAID levels are written, and a column 404 in which stripe sizes(stripe storage capacities) are written. That is, in this table 400, aVDEV identification number, a plurality of HDD identification numbersconfiguring the relevant VDEV, the RAID level of the relevant VDEV, anda stripe size are written for each VDEV.

FIG. 5 is a diagram showing an example of the configuration of a VDEVconfiguration table 500.

The VDEV configuration table 500 is for managing the configuration of aVDEV. More specifically, for example, this table 500 has a column 501 inwhich VDEV identification numbers are written, a column 502 in whichLDEV identification numbers are written, a column 503 in which the LDEVstart address of a range of logical addresses in the VDEV are written,and a column 504 in which the LDEV end address of a range of logicaladdresses in the VDEV are written. That is, the LDEV identificationnumber that exists in a specific range of logical addresses of aspecific VDEV is written in this table 500.

FIG. 6 is a diagram showing an example of the configuration of an LUconfiguration table 600.

The LU configuration table 600 manages the respective LU configurations.More specifically, for example, this table 600 has a column 601 in whichLDEV identification numbers are written, a column 602 in which WWN(World Wide Names) are written, a column 603 in which LUN (Logical UnitNumbers) are written, and a column 604 in which LDEV storage capacitiesare written. That is, in this table 600, an LDEV identification number,a WWN and LUN corresponding to the relevant LDEV, and the storagecapacity of this LDEV are written for each LU.

In this embodiment, a logical volume specified from the host 4 isreferred to as “LU” as described hereinabove, and more specifically, forexample, a logical volume correspondent to a WWN and LUN in the FibreChannel protocol is referred to as LU. Furthermore, for example, the WWNand LUN columns 602 and 603 need not be provided for a mainframe.

FIG. 7 is a diagram showing an example of the configuration of a DiskGroup configuration table 700.

The Disk Group configuration table 700 is used when the HDD specified incolumn 402 of the RAID configuration table 400 are divided up andmanaged in a plurality of groups. More specifically, for example, thistable has a column 701 in which VDEV identification numbers are written,a column 702 in which numbers uniquely identifying the groups inside theVDEV are written, a column 703 in which the identification numbers ofthe HDD configuring the respective groups are written, and a column 704in which the RAID levels of relevant groups are written. That is, inthis table 700, which VDEV group is configured from which HDD, and whatRAID level each relevant group has are written.

In this embodiment, columns that show RAID levels exist in two tables,the RAID configuration table 400 and the Disk Group configuration table700. As such, it is possible to express a special RAID level thatconnotes a plurality of other RAID levels inside a certain RAID group.Furthermore, a specific example of the data configured in the Disk Groupconfiguration table 700 will be described hereinbelow.

FIG. 8 is a diagram showing an example of the configuration of an HDDfailure-check table 800.

The HDD failure-check table 800 is for checking an HDD 16 in which afailure has occurred. More specifically, for example, this table 800 hasa column 801 in which HDD identification numbers are written, and acolumn 802 in which a flag for identifying whether or not a failure hasoccurred is written. That is, an HDD 16 in which a failure has occurredis written in this table 800.

FIG. 9 is a diagram showing an example of the constitution of a stripefailure-check table 900.

The stripe failure-check table 900 is for checking a stripe (forconvenience of explanation, a “failed stripe”), of the stripescomprising the failed HDD 16, from which a data element cannot be readout. This table 900 is prepared for all of the HDD 16 configuring a RAIDgroup. More specifically, for example, this table 900 has a column 901in which stripe identification numbers are written, and a column 902 inwhich flags for identifying whether or not a failure has occurred arewritten. That is, a stripe, which is a failed stripe, is written in thistable 800. Furthermore, in this figure, the identification numberassigned for uniquely specifying the stripe comprised in this HDD 16 iswritten in column 901 for each HDD 16, but like the identificationnumbers assigned in FIG. 3, the identification numbers assigned foruniquely specifying the stripes comprised inside a RAID group can alsobe written in column 901.

The preceding is an explanation of the various tables.

The storage system 1 related to this embodiment employs striping, whichdistributively writes data to a plurality of HDD 16 configuring a RAIDgroup the same as RAID 5 or RAID 6, but a method that differs from RAID5 and RAID 6 is used to create and write parity. Hereinbelow, the layoutof data elements and parities, and a restore carried out based on thislayout will be called a “data protection mode”. The storage system 1related to this embodiment can employ four types of data protectionmodes.

Any of these four types of data protection modes can restore a data setwhich has suffered a double-failure. More specifically, for example, anHDD from which data elements cannot be read out from any stripe(hereinafter, a completely failed HDD), and an HDD having both a stripefrom which a data element can be read out and a stripe from which a dataelement cannot be read out (hereinafter, partially failed HDD) exist ina single RAID group, and therefore, when there exists double-failuredata, which is a data set (a set comprising a data unit and a redundancycode) comprising two unreadable data elements, it is possible to restorethese two data elements. Furthermore, in the second and third dataprotection modes, even when there are two completely failed HDD in asingle horizontal RAID group (will be explained hereinbelow), the datainside these completely failed HDD can be restored. Further, in thefourth data protection mode, even when there are two completely failedHDD in a single partial RAID group (will be explained hereinbelow) thedata inside these completely failed HDD can be restored.

In the first and fourth data protection modes, a compressed parity,which is smaller in size than the total size of a plurality of paritiescorresponding to a plurality of data units on which the compressedparity is based, is created and stored, thereby consuming less storagecapacity than in RAID 6, which always creates two parities for one dataunit. In the second and third data protection modes, parities, which arebased on data elements inside data units of a plurality (a sufficientlylarge number is required, and this number will be explained hereinbelow)of horizontal RAID groups (will be explained hereinbelow), are createdand stored, thereby consuming less storage capacity than in RAID 6,which always creates two parities for one data unit.

These four types of data protection modes will be explained in detailhereinbelow. Further, the writing of data and parity to the HDD 16 iscarried out by the disk I/O processor 202, which is executed by the CPU131 of the DKA 13. Further, for both the first and fourth dataprotection modes, a stripe in which a compressed parity is stored willbe called a “specified stripe” for convenience sake. Conversely, astripe, which constitutes the write destination of data elements andparity, will be called a “normal stripe”. The above-mentioned “failedstripe” is a normal stripe from which it is impossible to read out adata element or parity.

FIG. 10 is a schematic diagram of a write in the first data protectionmode.

First, the manner in which data elements and parity are arranged in HDD16 in the first data protection mode will be explained.

In the first data protection mode, a compressed parity created from thedata elements and P parities of all the stripes comprised in an HDD 16are written for each HDD 16. This compressed parity is created bycomputing the exclusive OR of the data elements and P parities stored inall the normal stripes corresponding to the HDD 16 (or, by applying aprescribed coefficient to these data elements and P parities, andsubsequently computing the exclusive OR therefor). The locations of thestripes in which the compressed parities are written (that is, thespecified stripes), for example, can be at the tail ends of therespective HDD 16 as shown in the figure. In other words, the row ofstripes at the tail end of the VDEV is a row of specified stripesconfigured by four specified stripes. Furthermore, a specified stripecan be a stripe other than that at the tail end of an HDD 16.

In this embodiment, four compressed parities respectively correspondingto the four HDD 16-1 through 16-4 reside in the cache area of thecache/control memory 14, and are written from the cache area to thespecified stripes in a timely manner. For example, when a compressedparity is updated by a data write upon receiving of a write request,this updated compressed parity can be written to both an HDD 16 and thecache area upon being updated, and can also be written solely to thecache area without being written to an HDD 16. In the latter case, at asubsequent prescribed timing, at least one of the four compressedparities written to the cache area (for example, either all of thecompressed parities or only the updated compressed parity) is copied toa specified stripe in the HDD 16. By so doing, the time required for awrite is shortened since the HDD 16 stripe in which the compressedparity is written is not accessed each time a data write is carried outin response to receiving a write request. This will be explained usingHDD 16-1 as an example. Compressed parity “RP0”, which corresponds toHDD 16-1, is created based on the data elements and P parity (forexample, data elements “0”, “3”, and “6” and P parity “P3”) written inthe normal stripes of HDD 16-1, and this is written to the cache areaand the HDD 16-1. The creation and writing of the compressed parities“RP1” through “RP3” in the other HDD 16-2 through 16-4 are the same asfor the compressed parity “RP0” in HDD 16-1.

With the exception of a compressed parity being written, the first dataprotection mode is substantially the same as the data protection mode ofRAID 5. Therefore, when there are a total of four HDD 16 configuring theRAID group as in this figure, three data elements (for example, dataelements “0”, “1” and “2”), which configure a data unit, are written tothree of these HDD 16, and one P parity (for example, P parity “P0”),which is based on this data unit, is written to the one remaining HDD16. That is, if it is supposed that the number of HDD 16 configuring theRAID group is N (N being an integer of no less than 3), one data elementis written to each of (N−1) HDD 16, that is, a total of (N−1) dataelements is written to (N−1) HDD 16, and a single P parity, which iscreated on the basis of these (N−1) data elements, is written to the oneremaining HDD 16. The P parity is distributively written to the four HDD16-1 through 16-4. In other words, the HDD 16, which constitutes thewrite destination of the P parity, shifts for each data unit.

Next, the updating of the data unit and P parity will be explained inaccordance with the first data protection mode. Furthermore, in thefollowing explanation, a pre-update data element will be referred to asan “old data element”, and a post-update data element will be referredto as a “new data element” (the same will hold true for a P parity and acompressed parity).

For example, it is supposed that data element “1” of the three dataelements configuring the first data unit is updated. In this case, Pparity “P0” and compressed parity “RP1” must be updated. This is becauseboth the P parity “P0” and the compressed parity “RP1” were createdbased on the old data element “1”, and if data element “1” is updated,the values of P parity “P0” and compressed parity “RP1” will change.

Accordingly, disk I/O processor 202 first reads out the old data element“1”, which is required to create P parity “P0”, and the old P parity“P0” from HDD 16-2 and 16-4. Then, the disk I/O processor 202 can createa new P parity “P0” based on the new data element “1”, old data element“1” and old P parity “P0” (or, can create a new P parity “P0” from dataelements “0” and “2”, which have not been updated, and the new dataelement “1”). Further, the disk I/O processor 202 also carries out thecreation of a new compressed parity “RP1” using the same kind of methodas that for the creation of the new P parity “P0”. That is, the disk I/Oprocessor 202 can create a new compressed parity “RP1” based on the oldcompressed parity “RP1” corresponding to HDD 16-2, the old data element“1” and the new data element “1” (or, the disk I/O processor 202 canread out from HDD 16-2 a data element other than the old data element“1” and the P parity, and create a new compressed parity “RP1” based onthe read-out data element and P parity, and the new data element “1”).Thereafter, the disk I/O processor 202 writes the new data element “1”to the normal stripe in which the old data element “1” is stored, andwrites the new P parity “P0” to the normal stripe in which the old Pparity “P0” is stored. Further, the disk I/O processor 202 writes thenew compressed parity “RP1” to the cache area, and at a prescribedtiming, writes the new compressed parity “RP1” in the cache area to thespecified stripe in which the old compressed parity “RP1” is stored.

FIG. 11 is a schematic diagram of a restore in the first data protectionmode. Furthermore, in the following explanation, a data unit in whichthere is only one data element that cannot be read out, and a data unit,which is not double-failure data (as explained hereinabove, a data unitcomprising two unreadable data elements) will be referred to as“single-failure data”.

In this figure, HDD 16-1 is a completely failed HDD, and HDD 16-2 is apartially failed HDD. In this partially failed HDD 16-2, the normalstripe, in which data element “13” is stored, is the failed stripe.

The restore procedure for a data element, which constitutessingle-failure data, is the same as normal RAID 5. That is, the dataelement constituting the single-failure data, which is stored in thefailed stripe of completely failed HDD 16-1, is restored using all ofthe other data elements constituting this single-failure data, and the Pparity corresponding to this single-failure data (in other words, usingthe data elements and P parity, which are in all the other stripes inthe row of stripes comprising this failed stripe).

Conversely, two unreadable data elements, which constitutedouble-failure data, cannot be restored using the same procedure as thatof a normal RAID 5. This is because there is only one paritycorresponding to the double-failure data. In this example, the two dataelements “12” and “13” cannot be restored using only the remaining dataelement “14” and the P parity “P4”.

Accordingly, the restoration of these two data elements is carried outusing the compressed parity as follows.

First, data element “13”, which is stored in the failed stripe of thepartially failed HDD 16-2, is restored. More specifically, data element“13” is restored in accordance with the exclusive OR of the dataelements and P parity stored in all the normal stripes other than thefailed stripe in the partially failed HDD 16-2, and the compressedparity “RP1” corresponding to this partially failed HDD 16-2 (thecompressed parity “RP1” stored in either the cache area or the specifiedstripe) ((a) of this figure).

Next, data element “12”, which is stored in the failed stripe of thecompletely failed HDD 16-1, is restored based on this data element “13”,and the other data element “14” in the data unit comprising this dataelement “13” and the P parity “P4” corresponding to this data unit.Consequently, both data elements “12” and “13” in the double-failuredata are restored.

Lastly, the compressed parity, which is stored in the specified stripeof the completely failed HDD 16-1, is restored. More specifically, thecompressed parity “RP1” is restored on the basis of the restored dataelement “12” of the double-failure data, which is stored in the failedstripe of the completely failed HDD 16-1, and the other data elementsand P parity, which were restored the same as in RAID 5.

The preceding is an explanation for the first data protection mode.

Furthermore, in the above-described processing, when a P parity insteadof a data element is stored in the failed stripe of the partially failedHDD, the P parity stored in this failed stripe is restored based on thecompressed parity corresponding to this partially failed HDD, and thedata elements and P parity stored in the normal stripes other than thefailed stripe of this partially failed HDD. Then, using this restored Pparity, the data element stored in the failed stripe of the completelyfailed HDD is restored using the same method as that of a normal RAID 5.

Further, according to the above explanation, in the first dataprotection mode, all HDD other than the completely failed HDD can bepartially failed HDD. Moreover, to restore a data element stored in apartially failed HDD requires that only one failed stripe exist in thepartially failed HDD.

Further, in the first data protection mode, the total size of one dataunit and the P parity corresponding thereto is the same size as one rowof stripes. More specifically, one data unit and the P paritycorresponding thereto fits in one row of stripes, and do not span aplurality of rows of stripes. However, the present invention is notlimited to this, and, for example, the total size of one data unit andthe P parity corresponding thereto can be smaller than the size of onerow of stripes.

FIG. 12 is a schematic diagram of a write in the second data protectionmode.

The RAID configuration of the second data protection mode will beexplained first.

In the second data protection mode, as shown in FIG. 12, the HDD 16 arelogically arranged two-dimensionally, and respectively grouped in thehorizontal and vertical directions to configure RAID groups. In thisembodiment, a horizontal-direction RAID group will be called a“horizontal RAID group”, a vertical-direction RAID group will be calleda “vertical RAID group”, and the RAID group configured from all thehorizontal RAID groups and all the vertical RAID groups will be calledthe “total RAID group”. Further, a parity created inside a horizontalRAID group will be called a “horizontal parity” and a parity createdinside a vertical RAID group will be called a “vertical parity”.

According to FIG. 12, the respective HDD 16 configuring the VDEV aremembers of a certain one horizontal RAID group, and are also members ofa certain one vertical RAID group.

FIG. 12, for example, configures RAID 5 in connection with thehorizontal RAID groups, and configures RAID 4 in connection with thevertical RAID groups (that is, the horizontal RAID groups correspond toRAID 5, and the vertical RAID groups correspond to RAID 4). Here theRAID level of RAID 4 is similar to RAID 5, but differs from RAID 5 inthat a redundancy code is allocated to one prescribed HDD instead ofbeing distributively allocated to a plurality of HDD like RAID 5. TheRAID configuration will be explained in more detail. In the example ofFIG. 12, for example, RAID 5-based horizontal RAID groups i areconfigured from HDD 16-i-0, 16-i-1, 16-i-2 (i=0, 1, 2, 3), and RAID4-based vertical RAID groups j are configured from HDD 16-0-i, 16-1-j,16-2-j, 16-3-j, 16-4-j (j=0, 1, 2).

Next, the manner in which the data elements, horizontal parity andvertical parity are arranged in the HDD 16 in the second data protectionmode will be explained.

The horizontal parity inside a RAID 5 horizontal RAID group of RAID 5 isdistributively allocated to a plurality of HDD 16 inside this horizontalRAID group. Conversely, the vertical parity inside a RAID 4 verticalRAID group j is allocated to one prescribed HDD 16 (HDD 16-4-j in theexample of FIG. 12).

In this figure, the cells configuring the HDD 16 represent the stripes.An identifier (0, 1, 2, . . . ) for uniquely identifying the stripesinside the respective HDD 16 is provided in the left half of a cell. Anidentifier (HP0, HP1, HP2, . . . ) for uniquely identifying a horizontalparity inside a horizontal RAID group corresponding to a relevant cell,or an identifier (VP0, VP1, VP2, . . . ) for uniquely identifying avertical parity inside a vertical RAID group corresponding to a relevantcell is provided in the right half of a cell. A cell in which the righthalf of the cell is blank shows a cell in which a data element isstored.

In the example of this figure, for example, the horizontal parity(horizontal parity HP0) inside the horizontal RAID group 0 correspondingto stripe 0 inside HDD 16-0-0 and 16-0-1 is stored in the stripe 0inside HDD 16-0-2. Further, the vertical parity (vertical parity VP0)inside the vertical RAID group 0 corresponding to stripe 0 of HDD16-0-0, 16-1-0, 16-2-0 and 16-3-0 is stored in stripe 0 inside HDD16-4-0. Hereinafter, a group of a plurality of stripes in the samelocation (having the same identifier), from among the stripes inside aplurality of HDD 16 inside a certain vertical RAID group, will be calleda “row of vertical stripes”. By contrast, a group ofhorizontal-direction stripes inside a horizontal RAID group will becalled a “row of horizontal stripes”. A plurality of data elements and avertical parity, which is created from this plurality of data elements,are stored in one row of vertical stripes.

Furthermore, horizontal parities are stored in stripe 2 of HDD 16-0-0,16-1-0, 16-2-0 and 16-3-0, but vertical parity VP2, which could becreated from these horizontal parities, is not created (but could becreated). This is because each time the value of a data element isupdated, the horizontal parity is updated, and if a vertical parity wereto be created from horizontal parities corresponding to horizontalparities, the vertical parity corresponding to the horizontal paritieswould have to be updated in line with updating the value of the dataelement, resulting in numerous accesses to the HDD 16. Further, doublefailure data can still be restored even if a vertical paritycorresponding to horizontal parities is not created.

Next, data configured in the Disk Group configuration table 700 in thesecond data protection mode will be explained in detail.

In FIG. 7, Disks 16, 17 and 18 of Disk Group 0 of VDEV 2 respectivelycorrespond to HDD 16-0-0, 16-0-1 and 16-0-2 of FIG. 12. Since theseDisks configure a horizontal RAID group, which has a RAID level of RAID5, “RAID 5” is configured in FIG. 7 in RAID level column 704corresponding to Disk Group 0 of VDEV 2.

Further, in FIG. 7, Disks 16, 19, 22, 25 and 28 of VDEV 2 Disk Group 5correspond to HDD 16-0-0, 16-1-0, 16-2-0, 16-3-0 and 16-4-0 of FIG. 12.Because these Disks configure a vertical RAID group the RAID level ofwhich is RAID 4, “RAID 4” is configured in the RAID level column 704corresponding to VDEV 2 Disk Group 5 in FIG. 7.

Next, the storage capacity consumed in the second data protection modewill be explained.

First, the “data capacity ratio” of a certain RAID group will be definedas follows. That is, “RAID group data capacity ratio”=“total amount ofdata other than redundancy codes comprised in RAID group”/“totalcapacity of all HDD configuring RAID group”. This means that the largerthe data capacity ratio, the more it is possible to conserve the amountof HDD storage capacity consumed.

Here, for example, A=“total capacity of all HDD configuring RAID group”,B=“total amount of data other than redundancy codes comprised in RAIDgroup”, C=“RAID group data capacity ratio”=B/A, and, in addition, whenthe storage capacity of all the HDD 16 is the same, A0 and B0 are thenumber of HDD when the C of RAID 6 is C0, achieving the relationshipsA0=N+2, B0=N, and C0=B0/A0=N/(N+2) (Referring to FIG. 32A, in FIG. 32A,N is the number of HDD in which data elements are stored, and the 2 ofN+2 is the number of HDD in which parities are stored. Furthermore, FIG.32 is convenient for use in computing the capacities of A and B, andredundancy codes are actually distributively stored in a plurality ofHDD).

As a simple example of this, think of both the horizontal RAID groupsand the vertical RAID groups as RAID groups that correspond to RAID 4(Referring to FIG. 32B). In this case, the HDD, for example, can beclassified into four types of HDD: (1) HDD in which are stored dataelements; (2) HDD in which are stored horizontal parities; (3) HDD inwhich are stored vertical parities; and (4) HDD in which are stored thevertical parity of horizontal parities (or the horizontal parity ofvertical parities). It is possible to restore a data set when a doublefailure occurs inside a horizontal RAID group even without the HDD of(4). Therefore, a determination must be made as to whether or not toinclude the HDD of (4) in the “total capacity of all HDD configuring theRAID group”.

In a situation (Case 1) in which the HDD of (4) is included in the“total capacity of all the HDD configuring the RAID group”, thefollowing relational expressions are achieved:

A1=(N+1)(K+1)

B1=N·K

C1=B1/A1=N·K/((N+1)(K+1))

A1, B1, C1, respectively, are A, B, C in Case 1. In Case 1, the datacapacity efficiency is allowed to be higher than RAID 6,C1>C0, that is,If the expression below is satisfied.

N·K/((N+1)(K+1))>N/(N+2) . . . . (W)

If (W) is modified, it becomes

K>N+1 . . . (W)′

Conversely, in a situation (Case 2) in which the HDD of (4) is notincluded in the “total capacity of all the HDD configuring RAID group”,the following relational expressions are achieved:

A2=(N+1)(K+1)−1

B2=N·K

C2=B1/A1=N·K/(( N+1)(K+1)−1)

A2, B2, C2 are respectively A, B, C in Case 2. In Case 2, the datacapacity efficiency is allowed to be higher than RAID 6,C1>C0, that is,if the expression below is satisfied.

N·K/((N+1)(K+1)−1)>N/(N+2) . . . (w)

If (w) is modified, it becomes

K>N . . . (w)′

In Case 2, configuration-wise it is considered impossible for both thehorizontal RAID groups and the vertical RAID groups not to correspond toRAID 4. Therefore, either when one of the horizontal RAID groups orvertical RAID groups is RAID 5 and the other is RAID 4, or when both thehorizontal RAID groups and vertical RAID groups are RAID 5, the samethinking as in Case 1 should apply.

Therefore, in the second data protection mode (and the third dataprotection mode described hereinbelow), it is probably better to use theCase 1 concept of data capacity efficiency.

Next, the updating of a data unit, horizontal parity and vertical parityin accordance with the second data protection mode will be explained.Furthermore, in the following explanation, a pre-update data elementwill be referred to as an “old data element” and a post-update dataelement will be referred to as a “new data element” (the same will holdtrue for a horizontal parity and a vertical parity as well).

For example, take a case in which the stripe 0 inside HDD 16-0-0 is tobe updated. For the sake of explanation, the data inside stripe 0 of HDD16-0-0, 16-0-2 and 16-4-0, respectively, will be called data element“0”, horizontal parity “HP0” and vertical parity “VP0”. When updatingdata element “0”, it is necessary to update horizontal parity “HP0” andvertical parity “VP0”. This is because the values of the horizontalparity “HP0” and the vertical parity “VP0” are each dependent on dataelement “0”, and if the value of data element “0” is updated, the valuesof the horizontal parity “HP0” and the vertical parity “VP0” willchange.

Accordingly, the disk I/O processor 202 first reads from HDD 16-0-0 and16-0-2 the old data element “0” and the old horizontal parity “HP0”,which are required to create a new horizontal parity “HP0”. Then thedisk I/O processor 202 can create the new horizontal parity “HP0” on thebasis of the new data element “0”, the old data element “0” and the oldhorizontal parity “HP0” (or, can create a new horizontal parity “HP0”from the data element inside stripe 0 of HDD 16-0-1, which is the dataelement that has not been updated, and the new data element “0”).Further, the disk I/O processor 202 also carries out the creation of anew vertical parity “VP0” using the same method as that for creating thenew horizontal parity “HP0”. That is, the disk I/O processor 202 cancreate the new vertical parity “VP0” on the basis of the old verticalparity “VP0”, the old data element “0” and the new data element “0” (or,the disk I/O processor 202 can read out data from stripes 0 of HDD16-1-0, 16-2-0 and 16-3-0, and create a new vertical parity “VP0” on thebasis of the read-out data and the new data element “0”). Thereafter,the disk I/O processor 202 writes the new data element “0” to the stripein which the old data element “0” is being stored, writes the newhorizontal parity “HP0” to the stripe in which the old horizontal parity“HP0” is being stored, and writes the new vertical parity “VP0” to thestripe in which the old vertical parity “VP0” is being stored.

In the second data protection mode, since a vertical parity is updatedno matter what data element is updated, read/write access to the HDD 16storing the vertical parity occurs. For this reason, there are timeswhen the HDD 16 storing the vertical parity becomes an I/O performancebottleneck for the storage system 1. Accordingly, in order to do awaywith this performance bottleneck, the data unit update and the verticalparity update can be carried out asynchronously. More specifically,first, when updating the data unit, a bit corresponding to the relevantdata unit inside the cache/control memory 14-stored table that storeswhether or not a data unit has been updated, is set to ON. Verticalparity updating is not carried out at this time. Next, asynchronously tothe updating of the data unit (for example, at fixed times, or when theI/O load on the storage system 1 is low), the data unit required forvertical parity updating (a data unit for which the bit in theabove-mentioned table is ON) is selected by referring to theabove-mentioned table, and a new vertical parity is created based on thedata elements of the row of vertical stripes corresponding to thevertical parity that corresponds to this data unit, thereby updating thevertical parity.

Next, the mapping of a logical address and a physical address in thesecond data protection mode will be explained. In the second dataprotection mode, for example, it is possible to map a logical addressand a physical address such that consecutive logical addresses extendacross a plurality of horizontal RAID groups. In other words, forexample, two or more logical addresses, which respectively correspond totwo or more stripes configuring a Pth row of stripes inside an Xthhorizontal RAID group, and two or more logical addresses, whichrespectively correspond to two or more stripes configuring a Pth row ofstripes inside an X+1st horizontal RAID group are consecutive (where Xand P are integers greater than 0).

In the example of FIG. 12, for example, the logical addresses andphysical addresses are mapped such that stripes 0 of HDD 16-0-0, 16-0-1,16-1-0, 16-1-1, 16-2-0, 16-2-1, 16-3-0 and 16-3-1 constitute consecutivelogical addresses. When writing lengthy data having consecutive logicaladdresses, carrying out mapping like this increases the probability ofbeing able to create a vertical parity by determining the exclusive ORof data elements inside this lengthy data. In the example of FIG. 12above, it is possible to create a vertical parity VP0 inside HDD 16-4-0by computing the exclusive OR of stripes 0 of HDD 16-0-0, 16-1-0, 16-2-0and 16-3-0. Creating a vertical parity like this does away with the needto read out the old vertical parity from the HDD 16 when writing lengthydata, thereby making it possible to enhance write efficiency.

FIG. 13 is a schematic diagram of a restore in the second dataprotection mode. Even in the second data protection mode, it is possibleto restore one data element that cannot be read out from single-failuredata, and two data elements that cannot be read out from double-failuredata.

In FIG. 13, HDD 16-0-0 is a completely failed HDD, and HDD 16-0-1 is apartially failed HDD. In this partially failed HDD 16-0-1, stripes 0through 3 are failed stripes.

The procedure for restoring a data element configuring single-failuredata is the same as that in a normal RAID 5. That is, the data elementconfiguring the single-failure, which is stored in the failed stripe ofcompletely failed HDD 16-0-0, is restored using all the other dataelements configuring this single-failure data, and the horizontal paritycorresponding to this single-failure data (in other words, using thedata elements and horizontal parities of all the other stripes in therow of stripes comprising this failed stripe).

Conversely, two unreadable data elements configuring double-failure datacannot be restored with the same procedure as that of a normal RAID 5.

Accordingly, the restore of these two data elements is carried out asfollows using horizontal parity. That is, these two data elements arerestored via the same method as that of RAID 4 using the data elementsand vertical parities that are in all the other stripes in the(previously defined) row of vertical stripes comprising the failedstripe of the unreadable data element. In FIG. 13, for example, the dataelement corresponding to stripe 0 inside HDD 16-0-0 is restored from thedata elements and vertical parities corresponding to stripes 0 of HDD16-1-0, 16-2-0, 16-3-0, and 16-4-0. By so doing, it is possible torestore both of the two unreadable data elements configuring thedouble-failure data using the vertical parity. Furthermore, after usingvertical parity to restore one of the two data elements configuring thedouble-failure data, the relevant double-failure date becomessingle-failure data, thereby making it possible for the subsequentrestore procedure to be the same as the single-failure data restoreprocedure.

Single-failure data and double-failure data can be restored as describedhereinabove.

Furthermore, in the above explanation, it is supposed that HDD 16-0-1 isthe partially failed HDD, but the restore can be carried out the sameway even when HDD 16-0-1 is the completely failed HDD. Further, in theabove explanation, it is supposed that HDD 16-0-0 is the completelyfailed HDD, but the restore can be carried out the same way even whenHDD 16-0-0 is the partially failed HDD.

Further, in the above explanation, a situation in which an HDD 16configuring a horizontal RAID group constitutes completely failed HDDwas explained, but when the HDD, which is storing the vertical paritiesof a vertical RAID group (for example, HDD 16-4-0 of FIG. 13) is acompletely failed HDD, a vertical parity is created in accordance withthe parity creation method of RAID 4 from the other HDD of the relevantvertical RAID group, thereby restoring the completely failed HDD.

The preceding is an explanation of second data protection mode.

FIG. 14 is a schematic diagram of a write in the third data protectionmode.

The third data protection mode is a format that is similar to that ofthe second data protection mode. Hereinbelow, the third data protectionmode will be explained by focusing on the points of similarity andpoints of difference between the third data protection mode and thesecond data protection mode.

First, the RAID configuration in the third data protection mode will beexplained.

In the third data protection mode, horizontal RAID groups and verticalRAID groups are configured the same as in the second data protectionmode, but unlike the second data protection mode, RAID 5 rather thanRAID 4 is configured for the vertical RAID groups. The RAID groupconfigured from all the horizontal RAID groups and all the vertical RAIDgroups will be called the total RAID group the same as in the seconddata protection mode. The RAID configuration will be explained indetail. In the example of FIG. 14, for example, RAID 5-based horizontalRAID groups i are configured from HDD 16-i-0, 16-i-1, 16-i-2 (i=0, 1, 2,3, 4), and RAID 5-based vertical RAID groups j are configured from HDD16-0-j, 16-1-j, 16-2-j, 16-3-j, 16-4-j (j=0, 1, 2).

Next, the manner in which the data elements, horizontal parity andvertical parity are arranged in the HDD 16 in the third data protectionmode will be explained.

The horizontal parities inside a RAID 5 horizontal RAID group aredistributively allocated to a plurality of HDD 16 inside the relevanthorizontal RAID group. This is the same as the second data protectionmode. The third data protection mode differs from the second dataprotection mode in that the vertical parities inside a RAID 5 verticalRAID group j are distributively allocated to a plurality of HDD 16inside the relevant vertical RAID group. In other words, the HDD 16 thatconstitutes the allocation destination of a vertical parity shifts ineach row of vertical stripes. Furthermore, the definition of a row ofvertical stripes as used here is the same as in the second dataprotection mode.

In FIG. 14, the meaning of a cell, which configures an HDD 16, is thesame as in the second data protection mode.

In the example of FIG. 14, for example, the horizontal parity of thehorizontal RAID group 0 corresponding to the stripes 0 inside HDD 16-0-0and 16-0-1 is the stripe 0 (horizontal parity HP0) inside HDD 16-0-2.Further, the vertical parity of the vertical RAID group 0 correspondingto the stripes 0 inside HDD 16-0-0, 16-1-0, 16-2-0 and 16-3-0 is storedin the stripe 0 (vertical parity VP0) inside HDD 16-4-0.

Next, in the third data protection mode, since the specific example ofthe data configured in the Disk Group configuration table 700 ispractically the same as that explained for the second data protectionmode, a detailed explanation thereof will be omitted. Unlike in thesecond data protection mode, since the RAID level of the vertical RAIDgroup is RAID 5 in the third data protection mode, “RAID 5” instead of“RAID 4” is configured in the element of column 704 in which the RAIDlevel of the vertical RAID group is configured.

Next, the mapping of a logical address and a physical address in thethird data protection mode will be explained. In the third dataprotection mode, too, logical addresses and physical addresses can bemapped such that consecutive logical addresses span a plurality ofhorizontal RAID groups the same as in the second data protection mode.The specific example of this is practically the same as that explainedfor the second data protection mode. Carrying out mapping like this canincrease write performance by doing away with the need to read out theold vertical parity from the HDD 16 when writing lengthy data the sameas in the second data protection mode.

Next, since the method of updating a data unit, horizontal parity andvertical parity in accordance with the third data protection mode is thesame as that in the second data protection mode, with the exception ofthe fact that vertical parities are distributively allocated to aplurality of HDD 16 inside a vertical RAID group, the explanation willbe omitted.

FIG. 15 is a schematic diagram of a restore in the third data protectionmode.

In the third data protection mode as well, one unreadable data elementin a single-failure data and two unreadable data elements in adouble-failure data can be restored the same as in the second dataprotection mode.

In FIG. 15, HDD 16-0-0 is a completely failed HDD, and HDD 16-0-1 is apartially failed HDD. In this partially failed HDD 16-0-1, stripes 0through 3 are failed stripes.

Since the procedure for restoring the data elements configuring thesingle-failure data is the same as that of a normal RAID 5 and is thesame as the procedure explained for the second data protection mode,this explanation will be omitted.

Conversely, two unreadable data elements configuring a double-failuredata cannot be restored using the same procedures as a normal RAID 5.

Accordingly, the restore of these two data elements is carried out usingthe vertical parity the same as in the second data protection mode, butsince a vertical RAID group in the third data protection mode isconfigured in accordance with RAID 5, unlike in the second dataprotection mode, the restore is carried out via the same method as RAID5 using the data elements and vertical parities in all the other stripesof the (previously defined) row of vertical stripes comprising thefailed stripe, which is the unreadable data element. In FIG. 15, forexample, the data element corresponding to the stripe 0 inside HDD16-0-0 is restored based on the data elements and vertical paritycorresponding to the stripes 0 inside HDD 16-1-0, 16-2-0, 16-3-0 and16-4-0. In so doing, it is possible to restore both of the twounreadable data elements configuring the double-failure data using thevertical parity. Furthermore, after using vertical parity to restore oneof the two data elements configuring the double-failure data, therelevant double-failure data becomes single-failure data, thereby makingit possible for the subsequent restore procedure to be the same as thesingle-failure data restore procedure.

Single-failure data and double-failure data can be restored as describedhereinabove.

Furthermore, in the above explanation, it is supposed that HDD 16-0-1 isthe partially failed HDD, but the restore can be carried out the sameway even when HDD 16-0-1 is the completely failed HDD. Further, in theabove explanation, it is supposed that HDD 16-0-0 is the completelyfailed HDD, but the restore can be carried out the same way even whenHDD 16-0-0 is the partially failed HDD. Further, when the verticalparity constitutes the failed stripe, the relevant vertical parity canbe restored in accordance with the RAID 5 restore method based on theother data elements of the row of vertical stripes comprising therelevant vertical parity.

The preceding is an explanation of third data protection mode.

FIG. 16 is a schematic diagram of a write in the fourth data protectionmode.

First, the RAID configuration in the fourth data protection mode will beexplained.

In the fourth data protection mode, as shown in FIG. 16, a RAID group isconfigured by grouping together a plurality of HDD 16 (to expedite theexplanation, this RAID group will be called a “partial RAID group”), andthese RAID groups are further grouped together to configure a RAID group(to expedite the explanation, this RAID group will be called the “totalRAID group”). A VDEV is configured in accordance with the total RAIDgroup. In FIG. 16, for example, partial RAID group 0 is configured byHDD 16-0-0, 16-0-1, 16-0-2, and 16-0-2, and the total RAID group isconfigured from partial RAID groups 0, 1 and 2.

Next, the manner in which the data elements and redundancy data arearranged in the HDD 16 in the fourth data protection mode will beexplained. In the fourth data protection mode, a data unit is stored ina row of stripes of a partial RAID group. In the fourth data protectionmode, unlike with the data protection modes explained thus far, a Qparity (a second redundancy code) as well as P parity is created thesame as in RAID 6. However, whereas the Q parity is written to a partialRAID group for each data unit, P parity is not written for each dataunit. With regard to P parity, a plurality of P parities, whichcorrespond to data units inside a plurality of partial RAID groups, iscompressed into a single compressed parity, and written to the totalRAID group. Consequently, it is possible to conserve the consumedstorage capacity of the total RAID group while being able to restore twodata elements in double-failure data.

Prior to explaining the compressed parity in the fourth data protectionmode, the cells configuring the HDD 16 in FIG. 16 will be explained. Thecells inside the HDD 16 (for example, HDD 16-0-0) of FIG. 16 representstripes. Inside the respective cells is written either an identifier foruniquely identifying a data element inside the HDD 16 to which thestripe corresponding to the relevant cell belongs, or an identifier foruniquely identifying the Q parity inside the partial RAID group to whichthe stripe corresponding to the relevant cell belongs, or an identifierfor uniquely identifying the compressed parity inside the total RAIDgroup to which the stripe corresponding to the relevant cell belongs. Adata element is written in a cell in which a numeral (1, 2, 3, . . . )is written inside the cell. A Q parity, which is created from aplurality of data elements corresponding to i of the data elementsinside the partial RAID group to which the relevant stripe belongs, iswritten in a stripe corresponding to the cell in which “Qi” (i=1, 2, 3,. . . ) is written inside the cell. A compressed parity, which iscreated from a plurality of data elements corresponding to i of the dataelements inside the total RAID group to which the relevant stripebelongs, is written in the stripe corresponding to the cell in which“CPi” (i=1, 2, 3, . . . ) is written inside the cell.

In the fourth data protection mode, the compressed parity written to thespecified stripe is created by a code, which compresses a plurality of Pparities corresponding to a plurality of data units into one parity,more specifically, by computing the exclusive OR of this plurality of Pparities. In this embodiment, for example, one compression code iscreated on the basis of two P parities, which correspond to two dataunits. More specifically, for example, the compressed parity “CP1” iscreated by computing the exclusive OR of a first P parity correspondingto a first data unit (“P1-0”, which is the exclusive OR of data element“1” inside HDD 16-0-1, 16-0-2 and 16-0-3), and a second P paritycorresponding to a second data unit (“P1-1”, which is the exclusive ORof data element “1” inside HDD 16-1-1, 16-1-2 and 16-1-3).

In the fourth data protection mode, a plurality of specified stripesexists in one HDD 16. Thus, a plurality of rows of specified stripesexists in a partial RAID group. The plurality of rows of specifiedstripes can be distributed in the partial RAID group, or, as shown inthe figure, the plurality of rows of specified stripes can exist asconsecutive rows of specified stripes at the tail end of the partialRAID group.

The following rule is observed in the fourth data protection mode ofthis embodiment. That is, one compressed parity, which has been createdfrom a plurality of P parities corresponding to data units inside aplurality of partial RAID groups, is written to a partial RAID groupother than the partial RAID group used to create the relevant compressedparity, from among the plurality of partial RAID groups configuring thetotal RAID group. This is to avoid a situation in which the compressedparity and either the data elements in the respective data unitscorresponding thereto or the Q parities corresponding to the respectivedata units are read out simultaneously as the result of a HDD 16failure. For example, compressed parity “CP1” will be explained. Becausethe six data elements configuring the two data units corresponding tothe compressed parity “CP1” are written to HDD 16-0-1, 16-0-2, 16-0-3,16-1-1, 16-1-2 and 16-1-3, and the Q parities corresponding to these twodata units are written in HDD 16-0-0 and 16-1-0, the compressed parity“CP1” is written to the partial RAID group (in this example, partialRAID group 2) other than partial RAID groups 0 and 1, which are thepartial RAID groups configured from these HDD 16. In other words, theplurality of data units corresponding to the plurality of P parities,which is to be the basis for the compressed parity written to partialRAID group 2, should be the data units that are written in partial RAIDgroup 0 and partial RAID group 1.

Next, a specific example of data configured in the Disk Groupconfiguration table 700 in the fourth data protection mode will beexplained.

In FIG. 7, Disks 43, 44, 45 and 46 of Disk Group 0 of VDEV 4 correspondto HDD 16-0-0, 16-0-1, 16-0-2 and 16-0-3 of FIG. 16. Since these Disksconfigure a partial RAID group, “PARTIAL”, which represents the partialRAID group, is configured in RAID level column 704 corresponding to DiskGroup 0 of VDEV 4 in FIG. 7.

Next, the updating of a data unit, Q parity and compressed parity inaccordance with the fourth data protection mode will be explained.Furthermore, in the following explanation, a pre-update data elementwill be referred to as an “old data element”, and a post-update dataelement will be referred to as a “new data element” (the same will holdtrue for a Q parity and a compressed parity).

For example, take a case in which the stripe 1 inside HDD 16-0-1 is tobe updated. For the sake of explanation, the data inside the stripe 1 ofHDD 16-0-1 will be called data element “1”, the data inside the stripeQ1 of HDD 16-0-0 will be called Q parity “Q1”, and the data inside thestripe CP1 of HDD 16-2-0 will be called compressed parity “CP1”. Whenupdating data element “1”, it is necessary to update Q parity “Q1” andcompressed parity “CP1”. This is because the values of the Q parity “Q1”and the compressed parity “CP1” are each dependent on data element “1”,and if the value of the data element “1” is updated, the values of the Qparity “Q1” and the compressed parity “CP1” will change.

Accordingly, the disk I/O processor 202 first reads from HDD 16-0-1 and16-0-0 the old data element “1” and the old Q parity “Q1”, which arerequired to create a new Q parity “Q1”. Then the disk I/O processor 202can create the new Q parity “Q1” on the basis of the new data element“1”, the old data element “1” and the old Q parity “Q1” (or, can createa new Q parity “Q1” from the data elements inside the stripes 1 of HDD16-0-2 and 16-0-3, which are the data elements that have not beenupdated, and the new data element “1”). Furthermore, the creation of theQ parity conforms to the RAID 6 Q parity create method. Next, the diskI/O processor 202 also carries out the creation of a new compressedparity “CP1” using the same method as that for creating the new Q parity“Q1”. That is, the disk I/O processor 202 can create the new compressedparity “CP1” on the basis of the old compressed parity “CP1”, the olddata element “1” and the new data element “1” (or, the disk I/Oprocessor 202 can read out data from the stripes 1 inside HDD 16-0-2,16-0-3, 16-1-1, 16-1-2 and 16-1-3, and create a new compressed parity“CP1” on the basis of the read-out data and the new data element “1”).However, the creation of the compressed parity conforms to the RAID 6 Pparity create method. Thereafter, the disk I/O processor 202 writes thenew data element “1” to the stripe in which the old data element “1” isbeing stored, writes the new Q parity “Q1” to the stripe in which theold Q parity “Q1” is being stored, and writes the new compressed parity“CP1” to the stripe in which the old compressed parity “CP1” is beingstored.

Next, the mapping of a logical address and a physical address in thefourth data protection mode will be explained. In the fourth dataprotection mode, for example, it is possible to map a logical addressand a physical address such that consecutive logical addresses span aplurality of partial RAID groups. In other words, for example, two ormore logical addresses, which respectively correspond to two or morestripes configuring a pth row of stripes inside an xth partial RAIDgroup (for example, partial RAID group 0), and two or more logicaladdresses, which respectively correspond to two or more stripesconfiguring a pth row of stripes inside an x+1st partial RAID group (forexample, partial RAID group 1) are consecutive (where x and p areintegers greater than 0).

In the example of FIG. 16, for example, mapping of the logical addressesand physical addresses is carried out such that stripes 0 of HDD 16-0-1,16-0-2, 16-0-3, 16-1-1, 16-1-2 and 16-1-3 constitute consecutive logicaladdresses. When writing lengthy data having consecutive logicaladdresses, carrying out mapping like this increases the probability ofbeing able to create a compressed parity by determining the P parity ofdata elements inside this lengthy data. In the example of FIG. 16 above,it is possible to create a compressed parity CP1 inside HDD 16-2-0 usingthe data elements of stripes 1 inside HDD 16-0-1, 16-0-2, 16-0-3,16-1-1, 16-1-2 and 16-1-3. Creating a compressed parity like this doesaway with the need to read out the old compressed parity from the HDD 16when writing lengthy data, thereby making it possible to enhance writeefficiency.

FIG. 17 is a schematic diagram of a restore in the fourth dataprotection mode. In the fourth data protection mode as well, it ispossible to restore one data element that cannot be read out fromsingle-failure data, and two data elements that cannot be read out fromdouble-failure data.

In this figure, HDD 16-0-1 is a completely failed HDD, and HDD 16-0-2 isa partially failed HDD. In this partially failed HDD 16-0-2, stripes 1,2 and Q4 are failed stripes.

The procedure for restoring a data element configuring single-failuredata can utilize the same concept as that of a normal RAID 6. That is,the data element configuring the single-failure, which is stored in thefailed stripe of completely failed HDD 16-0-1, is restored using all theother data elements configuring this single-failure data, and the Qparities corresponding to this single-failure data (in other words,using the data elements and Q parities of all the other stripes in therow of stripes comprising this failed stripe). However, this procedureis only employed when it is possible to read out the Q paritiescorresponding to the failed stripe.

Conversely, two unreadable data elements configuring double-failure datacannot be restored with the same procedure as that of a normal RAID 6(for example, when the two unreadable data elements are stripes 1 of HDD16-0-1 and 16-0-2 of the same figure). Further, when a Q paritycorresponding to the failed stripe of single-failure data cannot be readout, the one data element configuring the single-failure data cannot berestored via the same procedure as that of a normal RAID 6 (for example,when the unreadable data element is stripe 2 of HDD 16-0-2 of the samefigure, and the unreadable Q parity is stripe Q2 of HDD 16-0-1 of thisfigure).

Accordingly, the restore of the two unreadable data elements configuringthe double-failure data, or the restore of one unreadable data elementconfiguring the single-failure data and the unreadable Q parity arecarried out as follows utilizing a compressed parity.

First, a P parity corresponding to either the double-failure data or thesingle-failure data is created ((a) in this figure). More specifically,for example, P parity “P1-0”, which forms the basis for compressedparity “CP1”, and which corresponds to partial RAID group 0 to which thefailed stripes belong, is created by computing an exclusive OR on thebasis of the compressed parity “CP1”, and the P parity “P1-1”, which isthe basis of the compressed parity “CP1”, and which corresponds to apartial RAID group other than partial RAID group 0 to which the failedstripes belong.

Thereafter, in accordance with the same procedure as RAID 6, the twounreadable data elements configuring the double-failure data, or oneunreadable data element configuring the single-failure data and theunreadable Q parity are restored ((b) in this figure). Morespecifically, for example, either the data elements or the Q parities ofthe two failed stripes (in this example, stripes 1 of HDD 16-0-1 and16-0-2) are restored on the basis of the created P parity “P1-0”, andeither the data element or Q parity (in this example, stripe Q1 of HDD16-0-0 and stripe 1 of HDD 16-0-3) of (readable) stripes other than thetwo failed stripes of the row of stripes of partial RAID group 0 towhich the failed stripes belong.

Finally, the respective compressed parities stored in the respectivespecified stripes of the completely failed HDD 16-0-1 are restored ((c)in this figure). More specifically, since all of the single-failure dataand double-failure data have been restored in accordance with therestores of the single-failure data and the double-failure dataexplained thus far, a P parity is created on the basis of the restoreddata unit, and compressed parities “CP6”, “CP18” and so forth, which arestored in the completely failed HDD 16-0-1, are restored on the basis ofthe created P parity and the P parity, which corresponds to the otherdata unit corresponding to the compressed parity to which this data unitcorresponds.

The preceding is an explanation for the fourth data protection mode.

Furthermore, this explanation used a RAID configuration based on a RAID6 known as a P+Q mode, but a RAID configuration based on RAID 6 known asa two-dimensional XOR mode can be used instead of this. When thetwo-dimensional XOR mode is used, a compressed parity is created bycreating two-dimensional XOR mode diagonal parities for a plurality ofpartial RAID groups and computing the exclusive ORs of these diagonalparities, and this compressed parity is stored in a different partialRAID group. Further, in the above explanation, it is supposed that HDD16-0-2 is the partially failed HDD, but the same restore would bepossible even if HDD 16-0-2 was the completely failed HDD. Further, inthe above explanation, it is supposed that HDD 16-0-1 is the completelyfailed HDD, but the same restore would be possible even if HDD 16-0-1were the partially failed HDD.

The preceding is explanations of the first through the fourth dataprotection modes. Furthermore, in all of the data protection modes,unreadable data elements in single-failure data and double-failure datafor completely failed HDD and partially failed HDD are restored to aprescribed storage resource, for example, a spare HDD.

The flows of the various processes carried out in this embodiment willbe explained hereinbelow.

FIG. 18 is a flowchart of processing carried out by the commandprocessor 201 when the storage system 1 receives an I/O request from thehost 4.

When the host 4 accesses a LU, an I/O request specifying the WWN and LUNof assigned to the LU, and then a read- or write-targeted address (LBA:Logical Block Address) is issued to the storage system 1. The commandprocessor 201 responds to the receiving of this I/O request, referencesthe LU configuration table 600, and calculates an LDEV identificationnumber (LDEV number) corresponded to the LUN and WWN (S101). Next, thecommand processor 201 determines whether or not the I/O request from thehost 4 is a write request (S102). When the I/O request is a writerequest (S102: YES), this processing proceeds to S103, and when the I/Orequest is not a write request (is a read request) (S102: NO), thisprocessing moves to S105.

In S103, the command processor 201 stores the write-data (write-targeteddata conforming to the I/O request) in an unused area of the cache areaof the cache/control memory 14, and in S104, notifies the host 4 thatthe relevant write process has been completed. The processing of S104can be carried out subsequent to this, for example, after S105. At thepoint in time of S104, the data write to the HDD 16 has not beencompleted, but notifying the host 4 that processing has been completedwhen the write-data is stored in the cache area makes it possible tospeed up write process response time.

In S105, the command processor 201 references the RAID configurationtable 400 and the VDEV configuration table 500, and determines the RAIDlevel of the VDEV comprising the LDEV to which the calculated LDEVnumber was assigned in S101.

When the value of the RAID level is “0x0a” (S105: 0x0a), the commandprocessor 201 implements a read or write process for the LDEV to whichthe calculated LDEV number was assigned in S101 based on the first dataprotection mode (S106). This process will be described in detail usingFIG. 19. Next, when the value of the RAID level is “0x0b” (S105: 0x0b),the command processor 201 implements a read or write process for theLDEV to which the calculated LDEV number was assigned in S101 based onthe second data protection mode (S107). This process will be describedin detail using FIGS. 22 and 23.

Next, when the value of the RAID level is “0x0c” (S105: 0x0c), thecommand processor 201 implements a read or write process for the LDEV towhich the calculated LDEV number was assigned in S101 based on the thirddata protection mode (S108). This process will be described in detailusing FIGS. 22 and 23.

Finally, when the value of the RAID level is “0x0d” (S105: 0x0d), thecommand processor 201 implements a read or write process for the LDEV towhich the calculated LDEV number was assigned in S101 based on thefourth data protection mode (S109). This process will be described indetail using FIGS. 27 and 28.

In S110, the command processor 201 determines if the received I/Orequest is a read request. When the relevant request is a read request(S110: YES), the read-targeted data from the HDD 16 is stored in thecache area via the processing of the above-mentioned S106, S107, S108and S109, and the command processor 201 returns the read-targeted datathat is in the cache area to the host 4 (S111). In S110, when it isdetermined that the relevant request is not a read request (S110: NO),this processing ends.

FIG. 19 is a flowchart of I/O processing based on the first dataprotection mode.

In the explanation of FIG. 19, the LDEV targeted by the I/O request willbe referred to as the “target LDEV”, the respective HDD of the VDEVcomprising the target LDEV will be referred to as the “target HDD”, andthe HDD address, which is corresponded to the volume area specified bythe LBA specified by the I/O request from the host 4, will be called the“target physical address”.

In S201, the LBA specified by the I/O request from the host 4 isconverted to the target physical address. More specifically, forexample, the command processor 201 sends to the DKA 13 an I/O requestcomprising the LBA specified in the I/O request from the host 4, and thedisk I/O processor 202 inside the DKA 13 receives this I/O request. ThisI/O request can be written to the control area of the cache/controlmemory 14, and it can be sent to the DKA 13 via the internal switch 15.The DKA 13, which receives the I/O request, is the DKA 13 connected tothe respective target HDD 16. The disk I/O processor 202 of the relevantDKA 13 converts the LBA inside the received I/O request to the targetphysical address.

In S202, the disk I/O processor 202 determines if the received I/Orequest is a write request. When the received I/O request is a writerequest (S202: YES), this processing proceeds to S203, and when thereceived I/O request is a read request (S202: NO), this processing movesto S208. Furthermore, the processing of S202 may be completed prior tothe end of S201 processing.

In S203, the disk I/O processor 202 locks the stripes related to thewrite for the write-targeted data placed in the cache area (for example,the stripes in which the respective data units, into which thewrite-targeted data has been divided, the P parities related to therespective data units, or an updated compressed parity are written) soas to make these stripes inaccessible to other requests. In the storagesystem 1, since a plurality of access requests from the host 4 arehandled simultaneously, there is the possibility of a plurality ofupdates being simultaneously generated to the stripes related to thewrite of the write-targeted data. If the updating of a stripe requiredto create P parity is generated by this other access process in themidst of a P parity create process (between S203 and S207), the contentsof the P parity will become inconsistent. To prevent this fromhappening, the lock process is carried out.

In S204, the disk I/O processor 202 creates a new P parity from the newdata element in the data unit, the old data element corresponding to thenew data element, and the new P parity, and writes this created new Pparity to the cache area. Furthermore, when the old data element and theold P parity are not stored in the cache area, the disk I/O processor202 reads out the old data element and the old P parity from the HDD 16prior to carrying out this processing.

In S205, the disk I/O processor 202 creates a new compressed parity fromthe new data element, old data element and old compressed parity, andwrites this created new compressed parity to the cache area.Furthermore, when the old compressed parity is not stored in the cachearea, the disk I/O processor 202 reads out the old compressed parityfrom the HDD 16 prior to carrying out this processing.

In S206, the disk I/O processor 202 writes the new data element and newP parity to the respective target HDD 16 by sending to the respectivetarget HDD 16 new data element and new P parity write requests, whichspecify the target physical addresses.

In S207, the disk I/O processor 202 writes the new compressed paritiesto the respective target HDD 16 by sending to the respective target HDD16 new compressed parity write requests, which specify the targetphysical addresses, and unlocks the stripes, which were locked in S203.Furthermore, the disk I/O processor 202, for example, can write all thepost-update compressed parities written to the cache area to the HDD 16simultaneously at a prescribed timing without carrying out theprocessing of this S207.

In S208, the disk I/O processor 202 reads out the read-targeted datafrom the respective HDD 16, and stores this read-out read-targeted datain the cache area.

FIG. 20 is a flowchart of a restore process in the first data protectionmode.

The disk I/O processor 202 carries out the restore process. In theexplanation of FIG. 20, it is supposed that there is one completelyfailed HDD in the RAID group. In the processing of FIG. 20, the disk I/Oprocessor 202 restores all the data element, P parity and compressedparity stripes in the completely failed HDD. Further, if either a dataelement or P parity read from the stripe of one other HDD fails at thetime of this restore (that is, when there is at least one failed stripebesides the completely failed HDD), the disk I/O processor 202 moves tothe processing of FIG. 21, and restores either the data element or Pparity comprised in the HDD 16 other than the completely failed HDD 16.

The disk I/O processor 202 records in the cache/control memory 14 thecount value (described as “count value A” hereinafter), which representsthe rank-order number of a row of stripes from the top of the HDD, andthe compressed parity (described as the “provisional compressed parityRD” hereinafter), which was arbitrarily updated in the midst of thisrestore process.

In S301, the disk I/O processor 202 respectively sets the initializationvalues of count value A and the provisional compressed parity RD to 0.

In S302, the disk I/O processor 202 reads out the data elements and Pparity (may be times when there is no P parity) from all the stripesother than the failed stripe (completely failed stripe hereinafter) ofthe completely failed HDD in the row of stripes specified from the countvalue A (the Ath stripe from the top). That is, the disk I/O processor202 reads out the Ath stripe from the top of an HDD 16 other than thecompletely failed HDD.

In S303, the disk I/O processor 202 determines whether or not the readin S302 was successful. When this read was successful (S303: YES), thisprocessing moves to S306, and when this read failed (S303: NO), thisprocessing proceeds to S304.

In S304, the disk I/O processor 202 restores the read-failed dataelement and/or P parity of the processing of S302. This process will beexplained in detail below by referring to FIG. 21.

In S305, the disk I/O processor 202 determines whether or not therestore of S304 succeeded. When the restore was successful (S304: YES),this processing proceeds to S306, and when the restore failed (S304:NO), this processing ends.

In S306, the disk I/O processor 202 creates either the data element or Pparity of the completely failed stripe corresponding to count value Afrom the data element and P parity read out in S303 or restored in S304by computing the exclusive OR thereof, and writes the created eitherdata element or P parity (described as the “restoration element ND”hereinafter) to the cache area.

In S307, the disk I/O processor 202 computes the exclusive OR of theprovisional compressed parity RD and restoration element ND stored inthe cache area, and makes the computed value the provisional compressedparity RD. That is, the provisional compressed parity RD is updated tothe most recent provisional compressed parity RD based on this restoredrestoration element ND.

In S308, the disk I/O processor 202 writes the restoration element ND tothe stripe, which is in the same location in the spare HDD as thelocation of the target stripe in the completely failed HDD 16.Furthermore, the spare HDD is mounted in the storage system 1, and isthe HDD that commences operation in place of the completely failed HDD16 subsequent to this restore process ending normally, in other words,the HDD that becomes a member of the RAID group in place of thecompletely failed HDD.

In S309, the disk I/O processor 202 adds 1 to the count value A.

In S310, the disk I/O processor 202 determines whether or not thepost-update count value A in accordance with S309 is the same as thenumber of rows of normal stripes. When count value A constitutes thenumber of rows of normal stripes (S310: YES), this processing proceedsto S311, and when the count value A is less than the number of rows ofnormal stripes (S310: NO), this processing moves to S302. The fact thatcount value A is the number of rows of normal stripes signifies that allthe normal stripes in the completely failed HDD have been restored.Moving to S311 means the provisional compressed parity RD is thecompressed parity that was completed based on the data elements and Pparities corresponding to all the normal stripes in the completelyfailed HDD.

In S311, the disk I/O processor 202 writes the provisional compressedparity RD (that is, the completed compressed parity) to the stripe,which is in the same location in the spare HDD 16 as the location of thespecified stripe in the completely failed HDD 16.

According to the above-described series of processes, all the dataelements, P parities and compressed parities stored in the completelyfailed HDD are restored to a spare HDD.

FIG. 21 is a flowchart of processing equivalent to S304 of FIG. 20.

This processing is implemented in S304 of FIG. 20. In this processing,the disk I/O processor 202 restores either a data element or P parity,which failed to be read out in S303 of FIG. 20. This processing iscarried out for an HDD 16 having a failed stripe, which is storingeither the data element or P parity that failed to be read out in S303of FIG. 20, that is, for a partially failed HDD. In FIG. 21, sinceeither the data element or P parity stored in this failed stripe is therestore target, hereinafter, this failed stripe will be referred to asthe “partially failed stripe”.

First, the disk I/O processor 202 writes to the cache area a count value(hereinafter, described as “count value B”), which represents therank-order number of a row of stripes from the top of the HDD, and amidway temporary value in the computing of either the data element or Pparity stored in the partially failed stripe (hereinafter, described asthe “provisional restoration element NDB”).

In S401, the disk I/O processor 202 respectively sets 0 as theinitialization value of count value B and provisional restorationelement NDB.

In S402, the disk I/O processor 202 determines whether or not countvalue B is the same as count value A. When count value B and count valueA are the same (S402: YES), this processing moves to S406, and whencount value B and count value A are different (S402: NO), thisprocessing proceeds to S403.

In S403, the disk I/O processor 202 reads out either the data element orthe P parity from the stripe specified from count value B.

In S404, the disk I/O processor 202 determines whether or not the readof S403 was successful. When this read was successful (S404: YES), thisprocessing proceeds to S405, and when this read failed (S404: NO), thisprocessing ends in an error. Thus, S305 of FIG. 20 becomes NO, and therestore process ends in an error. That is, when two failed stripes existin a partially failed HDD, a restore process in the first dataprotection mode ends in an error.

In S405, the disk I/O processor 202 computes the exclusive OR of eitherthe data element or P parity read in S403 and the provisionalrestoration element NDB already being stored in the cache area, andmakes this computed value the provisional restoration element NDB. Thatis, the provisional restoration element NDB is updated to the mostrecent value.

In S406, the disk I/O processor 202 adds 1 to count value B.

In S407, the disk I/O processor 202 determines whether or not countvalue B and the number of rows of normal stripes are the same. Whencount value B is the same as the number rows of normal stripes (S407:YES), this processing proceeds to S408, and when count value B differsfrom the number rows of normal stripes (S407: NO), this processing movesto S402. When count value B is the same as the number of rows of normalstripes, the provisional restoration element NDB, which was createdbased on the data elements and P parity stored in all the normal stripesbesides the one failed stripe in the partially failed HDD, constituteseither the data element or P parity stored in this failed stripe. Inother words, either the data element or P parity stored in the failedstripe has been restored in the cache area.

In S408, the disk I/O processor 202 writes the provisional restorationelement NDB, that is, either the data element or P parity stored in thefailed stripe, which has been restored in the cache area, to areplacement sector inside the partially failed HDD 16. The replacementsector is a reserved stripe provided in the HDD. Thereafter, thisreserved stripe can be used as the Ath stripe in the relevant HDD inplace of the failed stripe.

FIG. 22 is a flowchart of I/O processing based on the second and thirddata protection mode.

Similar to the explanation of FIG. 19, in the explanations of FIGS. 22and 23, an LDEV that is the target of an I/O request will be referred toas the “target LDEV”, the respective HDD of the VDEV comprising thetarget LDEV will be referred to as the “target HDD”, and the HDDaddress, which is corresponded to the volume area specified by the LBA,which is specified by the I/O request from the host 4, will be calledthe “target physical address”.

The processing of S2201 is the same processing as that of S201 of FIG.19. That is, in S2201, the LBA specified by the I/O request from thehost 4 is converted to the target physical address.

In S2202, the same as the processing of S202 of FIG. 19, the disk I/Oprocessor 202 determines if the received I/O request is a write requestor a read request. When the received I/O request is a write request(S2202: YES), this processing proceeds to S2203, and when the receivedI/O request is a read request (S2202: NO), this processing moves toS2209.

In S2203, the disk I/O processor 202 locks the stripes related to thewrite of the write-targeted data placed in the cache area (for example,the write-destination stripes, such as those of the data units intowhich the write-targeted data has been divided, and the horizontal andvertical parities corresponding thereto) so as to make these stripesinaccessible to other requests.

In S2204, the disk I/O processor 202 determines whether or not the sizeof the write-targeted data is greater than the size of the data unit.When the size of the write-targeted data is smaller than the size of thedata unit (S2204: NO), this processing moves to S2205, and when the sizeof the write-targeted data is larger than the size of the data unit(S2204: YES), this processing moves to S2208.

In S2205, the disk I/O processor 202 creates a new horizontal parityfrom the new data element in the data unit, the old data elementcorresponding to this new data element, and the old horizontal parity bycomputing the exclusive OR thereof, and writes this created newhorizontal parity to the cache area. Furthermore, when the old dataelement and the old horizontal parity are not stored in the cache area,the disk I/O processor 202 reads out the old data element and the oldhorizontal parity from the HDD 16 prior to carrying out this processing.

In S2206, the disk I/O processor 202 creates a new vertical parity fromthe new data element, old data element and old vertical parity, andwrites this created new vertical parity to the cache area. Furthermore,when the old vertical parity is not stored in the cache area, the diskI/O processor 202 reads out the old vertical parity from the HDD 16prior to carrying out this processing.

In S2207, the disk I/O processor 202 writes the new data element, newhorizontal parity and new vertical parity to the respective target HDD16 by sending to the respective target HDD 16 the new data element, newhorizontal parity, and new vertical parity write requests, which specifythe target physical addresses. Furthermore, the disk I/O processor 202,for example, can simultaneously write all of the post-update new dataelements, new horizontal parities, and new vertical parities that havebeen written to the cache area to the HDD 16 at a prescribed timingwithout carrying out this S2207 processing.

In S2208, the disk I/O processor 202 carries out a write process whenthe write-targeted data is larger than the data unit. This processingwill be described in detail below by referring to FIG. 23.

In S2209, the same as the processing of S208 of FIG. 19, the disk I/Oprocessor 202 reads out the read-targeted data from the respective HDD16, and stores this read-out read-targeted data in the cache area.

FIG. 23 is a flowchart of processing that is equivalent of S2208 of FIG.22.

This processing is implemented in S2208 of FIG. 22.

In S2301, the disk I/O processor 202 uses the respective data unitscorresponding to the write-targeted data placed in the cache area tocreate a new horizontal parity, and writes this created new horizontalparity to the cache area. Furthermore, according to the configuration ofa data unit (for example, when only one data element is updated), a newhorizontal parity can be created via the same processing as that ofS2205 of FIG. 22.

In S2302, the disk I/O processor 202 determines whether or not thenumber of data units stored in the cache area is greater than aprescribed number (the number of data units corresponding to onevertical parity). When a prescribed number or more of data units havebeen placed in the cache area (S2302: YES), this processing proceeds toS2303, and when a prescribed number or more of data units have not beenplaced in the cache area (S2302: NO), this processing moves to S2305.

In S2303, the disk I/O processor 202 creates a new vertical parity onthe basis of the data elements comprised in a plurality of data unitsstored in the cache area, and writes this new vertical parity to thecache area. Furthermore, when the number of data units cannot be dividedevenly into the number of data units corresponding to one verticalparity (when a remainder is generated), the disk I/O processor 202creates a new vertical parity for the surplus data units via theprocessing described for S2305 and S2306. Or, a new vertical parity canalso be created using a data unit to be written anew in the future, oran existing data unit.

In S2305, the disk I/O processor 202 reads out the old data element andthe old parity from the HDD 16.

In S2306, the same as the processing of S2206 of FIG. 22, the disk I/Oprocessor 202 creates a new vertical parity from the new data element,old data element and old vertical parity by computing the exclusive ORthereof, and writes this created new vertical parity to the cache area.

In S2307, the same as the processing of S2207 of FIG. 22, the disk I/Oprocessor 202 writes the new data element, new horizontal parity and newvertical parity to the respective target HDD 16 by sending to therespective target HDD 16 the new data element, new horizontal parity,and new vertical parity write requests, which specify the targetphysical addresses.

FIG. 24 is a flowchart of restore processing in the second and thirdprotection modes.

In the explanation of FIG. 24, it is supposed that there is onecompletely failed HDD in the RAID group the same as in the case of FIG.20. In the processing of FIG. 24, the disk I/O processor 202 restoresthe data elements and horizontal parities of all the stripes in thecompletely failed HDD.

The disk I/O processor 202 records in the cache/control memory 14 acount value (described as “count value A” hereinafter), which representsthe rank-order number of a row of stripes from the top of the HDD.

In S2401, the disk I/O processor 202 sets the count value A to 0.

In S2402, the disk I/O processor 202 reads out the data elements andhorizontal parity (there may not be any horizontal parity) from all thestripes other than the completely failed stripes in the row of stripesspecified from the count value A.

In S2403, the disk I/O processor 202 determines whether or not the readof S2402 was successful (if everything was read out), and when this readwas successful (S2403: YES), this processing moves to S2405, and whenthis read failed (S2403: NO), this processing moves to S2407. When theread fails, this signifies that there were two or more failed stripes inthe target row of stripes.

In S2404, the disk I/O processor 202 uses the data elements andhorizontal parity read out in S2402 to create either a data element or ahorizontal parity for the completely failed stripe, and writes this dataelement or horizontal parity to a stripe in the spare HDD 16, which isin the same location as the location of the completely failed stripecorresponding to count value A in the completely failed HDD 16.

In S2405, the disk I/O processor 202 adds 1 to the count value A.

In S2406, the disk I/O processor 202 determines whether or not the countvalue A is identical to the number of rows of stripes inside the HDD.When the count value A constitutes the number of rows of stripes (S2406:YES), this processing moves to S2409, and when the count value A doesnot constitute the number of rows of stripes (S2406: NO), thisprocessing moves to S2402.

In S2407, the disk I/O processor 202 selects one of the failed stripesin the target row of stripes, which has not been restored.

In S2408, the disk I/O processor 202 restores either the data element orthe horizontal parity that corresponds to the selected failed stripe.This processing will be explained hereinbelow by referring to FIG. 25.Furthermore, if there is only one failed stripe in the target row ofstripes, either the data element or the horizontal parity thatcorresponds to the relevant failed stripe can be restored via theprocessing of S2405 instead of this processing.

In S2409, the disk I/O processor 202 determines if all the failedstripes of the target row of stripes have been restored. If all thefailed stripes have been restored (S2409: YES), this processing moves toS2405. If all the failed stripes have not been restored (S2409: NO),this processing moves to S2407.

FIG. 25 is a flowchart of processing that is equivalent to that of S2408of FIG. 24.

This processing is implemented in S2408 of FIG. 24.

In S2501, the disk I/O processor 202 reads out from the correspondingstripes all of the data elements and vertical parities required torestore the data element and horizontal parity corresponding to thetarget stripe (the failed stripe or the completely failed stripe thatcorrespond to either the data element or horizontal parity that failedto be read out in the read implemented in S2402).

In S2502, the disk I/O processor 202 determines whether or not the readof S2501 was successful. When this read succeeded (S2502: YES), thisprocessing proceeds to S2503, and when this read failed (S2502: NO),this processing ends in an error.

In S2503, the disk I/O processor 202 uses the data element and verticalparity read out in S2502 to create a data element and a horizontalparity. Then, if either the created data element or horizontal parity iseither a data element or horizontal parity inside the completely failedHDD, the disk I/O processor 202 writes either the created data elementor horizontal parity to the stripe in the spare HDD 16, which is in thesame location as the location of the completely failed stripecorresponding to count value A. If either the created data element orhorizontal parity is not either a data element or horizontal parity ofthe completely failed HDD, either the created data element or horizontalparity is written to a replacement sector of the HDD 16 to which therelevant data element or horizontal parity belong.

FIG. 26 is a flowchart of processing in the second data protection modefor restoring an HDD in which a vertical parity is stored.

In the explanation of FIG. 26, it is supposed that, of the HDD 16configuring a vertical RAID group, the HDD 16 storing the verticalparity is a completely failed HDD. In the processing of FIG. 26, thedisk I/O processor 202 restores the vertical parity of all the stripesin the completely failed HDD.

The disk I/O processor 202 records in the cache/control memory 14 acount value (described as “count value A” hereinafter), which representsthe rank-order number of a row of stripes from the top of the HDD.

In S2601, the disk I/O processor 202 sets the count value A to 0.

In S2602, the disk I/O processor 202 reads out all the data elements,which are inside the vertical RAID group to which vertical paritybelongs, and which are needed to restore the relevant vertical parity ofthe row of stripes corresponding to the count value A.

In S2603, the disk I/O processor 202 determines whether or not the readof S2602 was successful. When this read succeeded (S2603: YES), thisprocessing proceeds to S2604, and when this read failed (S2603: NO),this processing ends in error.

In S2604, the disk I/O processor 202 creates a vertical parity from allthe data elements read out in S2602, and writes the created verticalparity to the stripe corresponding to the count value A in the spare HDD16.

In S2605, the disk I/O processor 202 adds 1 to the count value A.

In S2606, the disk I/O processor 202 determines whether or not the countvalue A is identical to the number of rows of stripes inside the HDD.When the count value A constitutes the number of rows of stripes (S2606:YES), this processing ends, and when the count value A does notconstitute the number of rows of stripes (S2606: NO), this processingmoves to S2601.

FIG. 27 is a flowchart of I/O processing based on the fourth dataprotection mode.

The same as the explanation of FIG. 19, in the explanations of FIGS. 27and 28, the I/O-request-targeted LDEV will be referred to as the “targetLDEV”, the respective HDD of the VDEV comprising the target LDEV will bereferred to as the “target HDD”, and the HDD address, which iscorresponded to the volume area specified by the LBA specified by theI/O request from the host 4, will be called the “target physicaladdress”.

The processing of S2701 is the same processing as that of S201 of FIG.19. That is, in S2701, the LBA specified by the I/O request from thehost 4 is converted to the target physical address.

In S2702, the same as the processing of S202 of FIG. 19, the disk I/Oprocessor 202 determines if the received I/O request is a write requestor a read request. When the received I/O request is a write request(S2702: YES), this processing proceeds to S2703, and when the receivedI/O request is a read request (S2702: NO), this processing moves toS2709.

In S2703, the disk I/O processor 202 locks the stripes related to thewrite of the write-targeted data placed in the cache area (for example,the write-destination stripes, such as those of the data units intowhich the write-targeted data has been divided, and the Q parity andcompressed parity corresponding thereto) so as to make these stripesinaccessible to other requests.

In S2704, the disk I/O processor 202 determines whether or not the sizeof the write-targeted data is greater than the size of the data unit.When the size of the write-targeted data is smaller than the size of thedata unit (S2704: NO), this processing proceeds to S2705, and when thesize of the write-targeted data is larger than the size of the data unit(S2704: YES), this processing moves to S2708.

In S2705, the disk I/O processor 202 creates a new Q parity from the newdata element in the data unit, the old data element corresponding tothis new data element, and the old Q parity, and writes this created newQ parity to the cache area. Furthermore, when the old data element andthe old Q parity are not stored in the cache area, the disk I/Oprocessor 202 reads out the old data element and the old Q parity fromthe HDD 16 prior to carrying out this processing.

In S2706, the disk I/O processor 202 creates a new compressed parityfrom the new data element, old data element and old compressed parity,and writes this created new compressed parity to the cache area.Furthermore, when the old compressed parity is not stored in the cachearea, the disk I/O processor 202 reads out the old compressed parityfrom the HDD 16 prior to carrying out this processing.

In S2707, the disk I/O processor 202 writes the new data element, new Qparity and new compressed parity to the respective target HDD 16 bysending to the respective target HDD 16 the new data element, new Qparity, and new compressed parity write requests, which specify thetarget physical addresses. Furthermore, the disk I/O processor 202, forexample, can simultaneously write all of the post-update new dataelements, new Q parities, and new compressed parities that have beenwritten to the cache area to the HDD 16 at a prescribed timing withoutcarrying out this S2707 processing.

In S2708, the disk I/O processor 202 carries out a write process whenthe write-targeted data is larger than the data unit. This processingwill be described in detail below by referring to FIG. 28.

In S2709, the same as the processing of S208 of FIG. 19, the disk I/Oprocessor 202 reads out the read-targeted data from the respective HDD16, and stores this read-out read-targeted data in the cache area.

FIG. 28 is a flowchart of processing that is equivalent to that of S2708of FIG. 27.

This processing is implemented in S2708 of FIG. 27.

In S2801, the disk I/O processor 202 uses the respective data unitscorresponding to the write-targeted data placed in the cache area tocreate a new Q parity, and writes this created new Q parity to the cachearea. Furthermore, according to the configuration of a data unit (forexample, when only one data element is updated), a new Q parity can becreated via the same processing as that of S2705 of FIG. 27.

In S2802, the disk I/O processor 202 determines whether or not thenumber of data units stored in the cache area is greater than aprescribed number (the number of data units corresponding to onecompressed parity). When a prescribed number or more of data units havebeen placed in the cache area (S2802: YES), this processing proceeds toS2803, and when a prescribed number or more of data units have not beenplaced in the cache area (S2802: NO), this processing moves to S2804.

In S2803, the disk I/O processor 202 creates a plurality of P paritiescorresponding to a plurality of data units stored in the cache area,creates a new compressed parity that compresses the created plurality ofP parities into one parity, and writes this new compressed parity to thecache area. Furthermore, if the number of data units cannot be dividedevenly into the number of data units corresponding to the one compressedparity (when a remainder is generated), the disk I/O processor 202creates a new compressed parity for the surplus data units via theprocessing described for S2805 and S2806. Or, a new compressed paritycan also be created using a data unit to be written anew in the future,or an existing data unit.

In S2804, the disk I/O processor 202 reads out the old data element andthe old compressed parity from the HDD 16.

In S2805, the same as the processing of S2706 of FIG. 27, the disk I/Oprocessor 202 creates a new compressed parity from the new data element,old data element and old compressed parity by computing the exclusive ORthereof, and writes this created new compressed parity to the cachearea.

In S2806, the same of the processing of S2707 of FIG. 27, the disk I/Oprocessor 202 writes the new data element, new Q parity and newcompressed parity to the respective target HDD 16 by sending to therespective target HDD 16 the new data element, new Q parity, and newcompressed parity write requests, which specify the target physicaladdresses.

FIG. 29 is a flowchart of restore processing in the fourth dataprotection mode.

In the explanation of FIG. 29, it is supposed that there is onecompletely failed HDD in the RAID group the same as in the case of FIG.20. In the processing of FIG. 29, the disk I/O processor 202 restoresthe data elements, Q parities and compressed parities of all the stripesin the completely failed HDD.

The disk I/O processor 202 records in the cache/control memory 14 acount value (described as “count value A” hereinafter), which representsthe rank-order number of a row of stripes from the top of the HDD.

In S2901, the disk I/O processor 202 sets the count value A to 0.

In S2902, the disk I/O processor 202 determines whether or not thefailed stripe of the completely failed HDD (hereinafter, the completelyfailed stripe) in the row of stripes specified from the count value A isthe specified stripe into which the compressed parity is written. Whenthe completely failed stripe corresponding to count value A is thespecified stripe (S2902: YES), this processing moves to S2910. When thecompletely failed stripe corresponding to count value A is not thespecified stripe (S2902: NO), this processing proceeds to S2903.

In S2903, the disk I/O processor 202 reads out the data elements and Qparities (there may not be any Q parities) from all the stripes otherthan the completely failed stripe in the row of stripes specified fromthe count value A.

In S2904, the disk I/O processor 202 determines whether or not the readof S2903 was successful (if everything was read out), and when this readsucceeded (S2904: YES), this processing proceeds to S2905, and when thisread failed (S2904: NO), this processing moves to S2908.

In S2905, the disk I/O processor 202 uses the data elements and Q parityread out in S2903 to create either a data element or a Q parity for thecompletely failed stripe, and writes this data element or Q parity to astripe, which is in the same location in the spare HDD 16 as thelocation of the completely failed stripe corresponding to count value Ain the completely failed HDD 16.

In S2906, the disk I/O processor 202 adds 1 to the count value A.

In S2907, the disk I/O processor 202 determines whether or not the countvalue A is identical to the number of rows of stripes inside the HDD.When the count value A constitutes the number of rows of stripes (S2907:YES), this processing ends, and when the count value A does notconstitute the number of rows of stripes (S2907: NO), this processingmoves to S2902.

In S2908, the disk I/O processor 202 determines whether or not one ofthe reads carried out in S2903 was a failed read. If there was onefailed read, this processing proceeds to S2909, and if there was morethan one failed read, this processing ends in an error.

In S2909, the disk I/O processor 202 restores the data of the completelyfailed stripe, and the data of the stripe that corresponds to either thedata element or the Q parity that failed to be read in S2903. Thisprocessing will be described in detail hereinbelow by referring to FIG.30.

In S2910, the disk I/O processor 202 reads all the data required forcreating a compressed parity for the stripe corresponding to the countvalue A, and creates a compressed parity.

In S2911, the disk I/O processor 202 stores the created compressedparity in the spare HDD 16 in the same location as the stripecorresponding to the count value A.

FIG. 30 is a flowchart of processing that is equivalent to that of S2909of FIG. 29 in the fourth data protection mode.

This processing is implemented in S2909 of FIG. 29 in the fourth dataprotection mode.

In S3001, the disk I/O processor 202 reads out the data unit andcompressed parity required to create a P parity corresponding to thedata unit of the target row of stripes.

In S3002, the disk I/O processor 202 determines whether or not the readof S3001 was successful. When this read succeeded (S3002: YES), thisprocessing proceeds to S3003, and when this read failed (S3002: NO),this processing ends in an error.

In S3003, the disk I/O processor 202 creates a P parity corresponding tothe data unit of the target stripe from the data unit and compressedparity read out in S3001 by computing the exclusive OR thereof.

In S3004, the disk I/O processor 202 restores the two data elements ofthe two failed stripes from the data element and Q parity read out inS2903, and the P parity created in S3003 using the same procedures asRAID 6.

In S3005, the disk I/O processor 202 respectively writes either the dataelement or Q parity restored in S3004 to a stripe in the same locationof the spare HDD 16 as the location of the completely failed stripe inthe completely failed HDD 16, and to a stripe in the same location of adifferent spare HDD 16 as the location of the partially failed stripe inthe partially failed HDD 16. Furthermore, when either the data elementor Q parity restored in S3004 was in the partially failed HDD 16, thewrite of either the restored data element or Q parity can be to thereplacement sector of the partially failed HDD instead of to the spareHDD.

The above-described embodiments of the present invention are examplesfor explaining the present invention, and do not purport to limit thescope of the present invention to these embodiments. The presentinvention can be put into practice in a variety of other modes withoutdeparting from the gist thereof. For example, in the above-citedexamples, a stripe that configures a row of stripes correspondsone-to-one with an HDD, but two or more stripes configuring a row ofstripes can also correspond to a single HDD. Further, in the fourth dataprotection mode, a P parity can be recorded for each data unit insteadof a Q parity, and a plurality of Q parities instead of a plurality of Pparities can be compressed into a single compressed parity and writtento a RAID group.

1. A storage system, comprising: a storage group configured by aplurality of storage devices; and a write controller that controlswriting to the storage group, the storage group being configured from aplurality of storage sub-groups; wherein the respective storagesub-groups being configured from two or more storage devices of theplurality of storage devices; wherein the plurality of storagesub-groups being configured from a plurality of first type storagesub-groups and a plurality of second type storage sub-groups; whereinthe two or more storage devices, which configure the respective secondtype storage sub-groups, being storage devices that respectivelyconstitute components of the plurality of first type storage sub-groups,and therefore, the respective storage devices that configure the storagegroup constituting components of both any of the plurality of first typestorage sub-groups, and any of the plurality of second type storagesub-groups; wherein respective first type sub-group storage areas, whichare respective storage areas of the respective first type storagesub-groups, being configured from a plurality of rows of first typesub-storage areas; wherein the row of first type sub-storage areasspanning the two or more storage devices configuring the first typestorage sub-group, and being configured from two or more first typesub-storage areas corresponding to these two or more storage devices;wherein the respective second type sub-group storage areas, which arethe respective storage areas of the respective second type storagesub-groups, being configured from a plurality of rows of second typesub-storage areas; wherein the row of second type sub-storage areasspanning the two or more storage devices that configure the second typestorage sub-group, and being configured from two or more second typesub-storage areas corresponding to these two or more storage devices;wherein a first type data unit and a second type data unit existing asdata units, the data being of a prescribed size, and made up of aplurality of elements; wherein the first type data unit is configuredfrom a plurality of data elements; wherein the second type data unit isconfigured from one data element of each of the plurality of first typedata units; and wherein the write controller: writes (W1) a data set,which comprises a plurality of data elements configuring the first typedata unit and a first redundancy code created based on the plurality ofdata elements, to the row of first type sub-storage areas; and writes(W2) a second redundancy code, which is created based on the second typedata unit that is present in the row of second type sub-storage areas,to a free second type sub-storage area in the row of second type substorage areas, wherein the respective second type storage sub-groups areRAID groups corresponding to RAID 4.